The basic algorithm of data regression classification prediction and Python implementation

Last Update:2018-07-29 Source: Internet

Author: User

Tags xgboost

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

the basic algorithm of data regression classification prediction and python Implement

About regression and classification of data and analysis of predictions. It is also considered as a relatively simple machine learning algorithm to discuss the algorithms for analyzing several comparative bases.

A. KNN algorithm

Proximity algorithms, which can be used for regression analysis, can also be used for classification analysis. The main idea is to take k nearest independent variable to find the average of its variables, so as to make a regression or classification. In general, the larger the K value, the more the output var will be, but the bias will become larger. Conversely, it may lead to overfitting. Therefore, the reasonable selection of the value of K is a very important step in the KNN algorithm.

Advantages

First, it's simple and effective. Second, the cost of retraining was low (changes in the category system and training set changes be common in Web environme NTS and E-commerce applications). Third, the calculation of time and space is linear to the size of the training set (in some cases not too large). Fourth, since the KNN method mainly depends on the neighboring limited samples, rather than determining the category by me Ans of discriminating the class domain, the KNN method is better than the other for the sample sets that has overlapping or overlapping classes. The method is more suitable. Fifth, this algorithm are more suitable for the automatic classification of class domains with large sample sizes, and thos E class domains with smaller sample sizes is more prone to misclassification using this algorithm.

Disadvantages

? The estimate of the regression function can be highly unstable as it's an average of only a few points. This is the price and we pay for flexibility.

? Curse of dimensionality.

? Generating predictions is computationally expensive

In summary, the proximity algorithm is easy to understand, but when dealing with large data volumes and multi-dimensional, the computational complexity of the algorithm increases a lot, so in this case it is not recommended.

KNN's Python implementation:

(Select data)

Or there are two variables:

(Take K values 2 and 50 for example)

(Finally, evaluation of the model)

It is worth mentioning that the value of K is a choice to take.

You can find the Rmse minimum value in the test dataset by enumerating the values of the various K. (The RMSE in the training data set increases with the increase of K)

two. regularization of the regression

When we do the return, the main thing to consider is Var and bias two aspects of things. When we take the OLS approach, only considering his bias, as the predictor increase and complexity increases, the bias will be smaller but the relative Var will increase, which has a great impact on our predictions. How to balance the two is the most worthwhile consideration.

Therefore, there is the concept of regularization (regularization)

2.1 Ridge Regression

The second is also called L2 regularisation

Advantages

Solving multilinearity is one of the advantages of ridge regression. Using the ridge model can improve predicted performance. Another advantage is, the ridge model can significantly solve the over-adjustment problem by introducing a penalty ter M. Therefore, the unimportant features from the use of burrs to the regularization of features become infinitely close to Zero, efficiently reducing the variance and improving the performance of the prediction model.

Disadvantages

Since the coefficients of the penalty term can become infinitely close to zero but it can is not is zero, there is still man Y features that can is not explained completely.

In conclusion, it is possible to solve multiple collinearity problems (multiple collinearity means that the model is estimated to be distorted or difficult to estimate accurately due to the existence of an exact correlation or highly correlated relationship between explanatory variables in a linear regression model). For the overfitting is also a punitive, practical application can try to see the specific error size.

Implementation of Python:

2.2 Lasso Regression

The second is also called L1 regularisation

The pros and cons are similar to ridge regression.

Implementation of Python:

It is worth mentioning that the advantage of Python is that by entering the value from CV, Python automatically chooses the best lamda interactively.

three. XGB BOOST

XGB boost is by far the most effective way to predict data classification. The accuracy rate is the most important in all methods, so it has great practical significance.

The principle of XGB boost is based on the classification method of decision tree. The summation of different decision trees is obtained by the final classification. For ordinary decision trees, the first is to build as large a tree as possible, and then start cropping with a greedy algorithm. The point of XGB is that every new tree he adds is the best added. So that the final result can achieve an optimal solution. On the other hand, XGB added the penalty for complexity, that is, the regular term, which contains the number of leaf nodes in the tree, the sum of the squares of the score L2 of the output on each leaf node (for its specific principle, the understanding is not thorough enough).

Algorithm

Advantages

1. Comparing with gradient boost, xgboost are faster, since the weight of xgboost is known as Newton "step", which does not Need line search, the step length has been naturally known as ' 1 '.

2. Advantage in characteristics rank, since Xgboost ranks the data and set the result as block types before the training, The block data type can be used repeatedly in further boosting.

3. Xgboost dealing with bias-variance tradeoff, the result of regularization term can control the complex level, and avoid ing overfitting.

In summary, XGB is a very useful algorithm.

Implementation of Python:

(Select parameters)

How to tune the parameter is a very important step, in a given range of conditions Python will automatically select the best solution.

Here you can refer to related blogs (52557382)

Data source: (52557382)

Four LGB Boost

LGB boost is an algorithm introduced by Microsoft Corporation 2016, which is an improvement over the XGB algorithm. It mainly improves the speed of the XGB algorithm, and the corresponding cost is the loss of precision.

Algorithm

The algorithm is similar with xgboost, except the tree learning growth direction while the data is small, LIGHTGBM are to G Rowth trees leaf-wise. The other traditional algorithm are to grow trees by depth-wise. The parallel features which are the most different and the other have been shown below (Sphinx):

1. Workers Find local Best split point {feature, threshold} on local feature set.

2. Communicate local best splits with each other and get the best one.

3. Perform the optimum split.

Advantages

1. Optimization in speed and reducing memory usage, especially large number data training.

2. Optimization in accuracy, differ and the most of the tree learning algorithms, LIGHTGBM does not grow trees by depth-wise, it Grows trees leaf-wise, when the data is small.

3. Optimal split for categorical features, since LIGHTGBM uses it accumulated values to sorts the histogram, and then Ben Efit from the "The best" split on the sorted histogram have been found.

Implementation of Python:

(All the English parts of this article are from a report that was completed with team members during the learning process)

The basic algorithm of data regression classification prediction and Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More