#MachineLearning #Algorithms #Boosting #XGBoost #MLAlgorithms #DataScience
Features - XGBoost
Numeric VS categorical variables
Tree Boosting in a Nutshell
Industry Usage?
Why XGBoost ?
Xgboost is short for eXtreme Gradient Boosting package.
BTW what is boosting?
Quick Explanation
Two common terms used in ML is Bagging & Boosting
Bagging: It is an
approach where you take random samples of data, build learning algorithms and
take simple means to find bagging probabilities.
Boosting: Boosting is similar,
however the selection of sample is made more intelligently. We
subsequently give more and more weight to hard to classify observations.
Now coming back to
XGBoost, what is it so important ?
In broad terms, it’s the efficiency, accuracy and feasibility
of this algorithm.
It has both linear model solver and tree learning algorithms.
So, what makes it fast is its capacity to do parallel computation on a
single machine.
It also has additional features for
doing cross validation and finding important variables.
Features - XGBoost
- Speed: it can automatically
do parallel computation on Windows and Linux,
with OpenMP. It is generally over 10 times faster than the
classical gbm.
- Input Type: it takes several
types of input data:
- Dense Matrix: R's dense matrix,
i.e. matrix ;
- Sparse Matrix: R's sparse matrix,
i.e. Matrix::dgCMatrix ;
- Data File: local data files
;
- xgb.DMatrix: its own class
(recommended).
- Sparsity: it accepts sparse input
for both tree booster and linear booster, and
is optimized for sparse input ;
- Customization: it supports
customized objective functions and evaluation functions.
Numeric VS categorical variables
Xgboost manages only numeric vectors.
What to do
when you have categorical data?
A simple method to convert categorical variable into
numeric vector is One Hot Encoding.
Tree Boosting in a Nutshell
We
first briefly review the learning objective in tree boosting. For a given data
set with n examples and m features a tree ensemble model
(shown in Fig. above ) uses K additive functions to predict the output.
Industry Usage?
It
has also been widely adopted by industry users, including Google, Alibaba and Tencent, and various startup companies.
According to a popular article in Forbes, xgboost can scale with hundreds of
workers (with each worker utilizing multiple processors) smoothly and solve machine
learning problems involving Terabytes of real world data.
No comments:
Post a Comment