A ramble on machine learning

Last Update:2015-08-12 Source: Internet

Author: User

Tags svm to domain

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A ramble on machine learning

The data Mining/machine learning project typically consists of four key sections, namely, data analysis, feature engineering, model building, validation.

1 Data Analysis

In a broad sense, data analysis includes data collection, data processing, cleansing, exploratory data analysis, modeling and algorithmic design, data visualization, and so on [1]. In the narrow sense, data analysis refers to exploratory data analysis (EDA).

The so-called Exploratory data Analysis (exploratory, hereinafter referred to as EDA), refers to the existing data (especially the original data investigated or observed) in the least possible prior assumptions, through drawing, tabulation, equation fitting, A data analysis method for exploring the structure and law of data by means of calculating the characteristic amount [2].

Data analysis tools are commonly used excel,spss,python,r and so on. The tools I use most often are Excel and Python.

What can data analysis do?

1. Calculate the range of each attribute in the data, four percentile points, percentile points, etc.

2, single attribute sorting, multi-attribute sorting, take top n or bottom n.

3, filter by conditions. Multiple conditions combined filter, do orthogonal difference.

4. Use the chart to view the distribution of a property value on a given data set. such as Box line chart, histogram, line chart.

5. Use scatter plots to view the correlation of two attributes.

6, clustering analysis, through the visualization of data to find similar objects. Clustering is the similarity of objects into the same group, so that the group has a high similarity of objects, the difference between groups of objects large [3].

7. Use scatter plot to view anomaly points.

2 Feature Engineering

Feature engineering is closely related to domain knowledge and requires an in-depth understanding of the current business. Features can be divided into two categories, sparse features and tight features. Sparse feature refers to a feature that has very few attribute values that are nonzero.

2.1 Design features

Taking commodity recommendation as an example, three basic characteristics are constructed first.

1, the user's behavior characteristic to the commodity. The last 3 days (7 days/14 days/30 days/total) The number of times the user clicked (Favorite/Added cart/purchase) items, the time of the final click, the number of days to click (Collect/Add to cart/purchase) ...

2, the characteristics of the brand itself. Last 7 days (30 days/total) Number of clicks (Favorites/Add to Cart/purchase), last 7 days (30 days/total) Click (Collect/Add to cart/purchase) Number of users of this product, number of repeat customers ...

3, the user's own characteristics. Number of items purchased; First (last) Access time (purchase time) ...

Some characteristics are derived from the basic characteristics, which contain many strong related characteristics. For example, conversion rate, the number of times the user clicked (purchased) the product in the last one months divided by the number of clicks (purchases) of all items ...

The expansion of features is usually the division of basic features 22, multiplication, intersection, seeking and so on, to obtain new features. It is one of the most common techniques to extend the attribute value of single feature to multiple features with 0-1 encoding, commonly known as "dummy variable". It is also possible to add the attribute values of multiple features to a certain weight to form a new feature.

2.2 Normalization of features

Typically, SVM and GBDT models require pre-feature normalization, and RF is not required. There are three methods of normalization of common features.

1, the maximum minimum value normalization. X ' = (x-min)/(Max-min)

2, Z-score normalization. X ' = (x-μ)/σ

3, logarithmic normalization. X ' = log (1+x)

2.3 Feature Selection

Informally, feature selection is the selection of a subset from a large set of primitive features, making the model simple and effective. Feature selection has three advantages: 1, enhance the generalization ability of the model, improve the performance of the Predictor, 2, reduce the space of the algorithm consumption, shorten the time of the algorithm consumption, 3, the model is easier to explain.

Feature selection algorithms are divided into three categories.

1, feature sorting, also known as Filter Feature Selection method (Filter Methods). Regardless of the dependence between features, according to a certain standard for each feature scoring, from high to low selection characteristics. For example, the correlation coefficients for each feature and target variable are computed separately, and the top n variables with the highest absolute value are taken. Common criteria include chi-square inspection (the larger the card-square value is the more relevant), the information gain (the better the information gain), the Gini index (the smaller the Gini index, the better), the correlation coefficient (the greater the correlation coefficient, the better). The main difference between the three kinds of classical decision tree models is that the feature selection algorithm is different, the ID3 uses the information gain, the C4.5 uses the information gain rate, and the CART uses the Gini coefficient.

2, wrapper method (Wrapper Methods). There are three main wrapper methods, forward greedy algorithm, backward greedy algorithm, forward backward algorithm. The forward greedy algorithm, that is, from the empty feature set, adds a feature to the collection each time until the model performance is no longer improved. The backward greedy algorithm, which removes a feature from the collection every time from the full feature set, until the model performance is no longer improved. There are two disadvantages of this kind of method, which are easy overfitting and long calculation time.

3, embedded method (Embedded Methods). Embedded methods are similar to wrapper methods, but embedded methods are not easy to fit and consume less time. As an example of an embedded method, the L1 regular term is introduced, and the feature of weight 0 after training is the discarded feature.

The feature selection methods introduced in the "Scikit-learn" document [5] include dropping the characteristics of low variance, chi-squared inspection, recursive elimination of features by cross-examination, training with linear model with L1 regular term and choosing the features of non-0 weight, tree-based feature selection and so on.

3 Building a model

The four common models are LR (linear regression/logistic regression), SVM, RF, GBDT, respectively. Each model has its own loss function, and the loss function consists of two parts: loss term and regular term. The linear regression uses the square error loss function, the logistic regression uses the log loss function, and the SVM uses the hinge loss function. For the classification problem, RF usually uses the Gini index as the loss function, also called the evaluation criterion, and sometimes chooses the information gain rate as the evaluation criterion. For regression problems, RF usually takes the mean square error as a loss function. For classification problems, GBDT usually takes a negative two-item logarithmic likelihood function as a loss function. For regression problems, the loss function commonly used by GBDT includes the square error loss function, Huber loss function (insensitive to outliers), exponential loss function, logarithmic loss function, etc.

3.1 LR (linear regression/logistic regression)

Generally speaking, linear regression is suitable for regression problems, and logistic regression is suitable for classification problems. LR (linear regression/logistic regression) model is very simple, not easy to fit, suitable for baseline. Linear fit good general see r2,r2 closer to 1 the better. LR is computationally fast and is used in conjunction with L1 regularization to handle massive amounts of data that contain thousands of dimensional features. The LR model is highly explanatory and widely used, and is the cornerstone of other models.

3.2 SVM

In popular point, linear SVM is to find a super plane on a given data set, so that the support vector (the closest point of the plane to the super plane) is the largest distance from the super plane. Linear SVM is not only suitable for large samples, but also for small sample classification problems. Kernel functions allow SVM to solve nonlinear problems, and the most common kernel functions are radial basis functions.

3.3 RF

Before learning the stochastic forest model, you must first understand the decision tree model. The deeper the tree, the more complex the model.

The advantages of the decision tree model are as follows.

1, easy to understand and explain, the tree can be visualized.

2, does not need too much data preprocessing work, namely does not need to carry on the data normalization, creates the dummy variable and so on operation.

3, implicitly created a number of joint features, and can solve nonlinear problems.

The biggest disadvantage of decision tree model is that it is easy to overfitting.

Random forests are made up of a number of different decision trees, and for a given prediction object, each decision tree outputs a label, and finally takes the "vote" method and selects the label with the most votes as the final result. Random forest is an integrated method and is also considered as one of the nearest neighbor predictors. The integration method is to combine a set of weak classifiers in a certain way to form a strong classifier.

Steps to build a single tree:

1, there are put back random sampling, the number of samples accounted for 2/3 of the total.

2, for each node, randomly select m features, from which to choose to provide the best partition of the characteristics and division, the next node repeat the first two steps until all training samples belong to the same class.

The error rate of random forests depends on two things.

1, the greater the correlation between trees, the higher the overall error rate.

2, the higher the error rate of single tree, the higher the overall error rate.

Advantages of random forests:

1, easy to understand and explain, the tree can be visualized.

2, does not need too much data preprocessing work, namely does not need to carry on the data normalization, creates the dummy variable and so on operation.

3, implicitly created a number of joint features, and can solve nonlinear problems.

4, compared with decision tree model, GBDT model, stochastic forest model is not easy to fit.

5. Self-out-of-bag (OOB) error evaluation function.

6, easy to parallelize.

Disadvantages of random forests:

1, not suitable for small samples, only suitable for large samples.

2. In most cases, the accuracy of the RF model is slightly lower than the accuracy of the GBDT model.

3, suitable for the decision-making boundary is rectangular, not suitable for diagonal type.

3.4 GBDT

Advantages of GBDT:

1, can solve non-linear problems.

2, high precision, especially in the regression problem, the effect of GBDT is usually better than RF.

Disadvantages of GBDT:

1, need to do some data preprocessing work, such as feature normalization.

2, compared with the RF model, the GBDT model has many parameters, and the model is more sensitive to parameters.

3. GBDT models are more prone to overfitting than RF models.

4, not easy to parallelize.

4 Verification

The most common authentication method is cross-validation. Sometimes for convenience, we can also perform a simple validation: Randomly split the original data into two parts, part of the training set, and the other part as a validation set. A model is trained on the training set, then the model is used on the validation set, and the accuracy, recall, or other indicators can be calculated based on the predicted results and the "standard answer" on the validation set.

References

"1" https://en.wikipedia.org/wiki/Data_analysis

"2" http://blog.sciencenet.cn/blog-350729-662859.html

"3" https://en.wikipedia.org/wiki/Cluster_analysis

"4" http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf

"5" http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

A ramble on machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More