Saptak's Diary: Sep 11, 2017

Monday, September 11, 2017

Why XGBoost ? and Why is it so Powerful in Machine Learning

#MachineLearning #Algorithms #Boosting #XGBoost #MLAlgorithms #DataScience

Why XGBoost ?

Xgboost is short for eXtreme Gradient Boosting package.

BTW what is boosting?

Quick Explanation

Two common terms used in ML is Bagging & Boosting

Bagging: It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.

Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.

Now coming back to XGBoost, what is it so important ?

In broad terms, it’s the efficiency, accuracy and feasibility of this algorithm.

It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.

It also has additional features for doing cross validation and finding important variables.

Features - XGBoost

Speed: it can automatically do parallel computation on Windows and Linux, with OpenMP. It is generally over 10 times faster than the classical gbm.
Input Type: it takes several types of input data:

Dense Matrix: R's dense matrix, i.e. matrix ;
Sparse Matrix: R's sparse matrix, i.e. Matrix::dgCMatrix ;
Data File: local data files ;
xgb.DMatrix: its own class (recommended).

Sparsity: it accepts sparse input for both tree booster and linear booster, and is optimized for sparse input ;
Customization: it supports customized objective functions and evaluation functions.

Numeric VS categorical variables

Xgboost manages only numeric vectors.

What to do when you have categorical data?

A simple method to convert categorical variable into numeric vector is One Hot Encoding.

Tree Boosting in a Nutshell

We first briefly review the learning objective in tree boosting. For a given data set with n examples and m features a tree ensemble model (shown in Fig. above ) uses K additive functions to predict the output.

Industry Usage?

It has also been widely adopted by industry users, including Google, Alibaba and Tencent, and various startup companies. According to a popular article in Forbes, xgboost can scale with hundreds of workers (with each worker utilizing multiple processors) smoothly and solve machine learning problems involving Terabytes of real world data.

Saptak's Diary

Monday, September 11, 2017

Why XGBoost ? and Why is it so Powerful in Machine Learning

Features - XGBoost

Numeric VS categorical variables

Tree Boosting in a Nutshell

Industry Usage?

Followers

Blog Archive

MY NAME IN ANANDAMELA SOME YEARS BACK