Machine learning Practice One

Source: Internet
Author: User
Tags numeric relative svm vars rbf kernel

The problem of machine learning is divided into supervised learning problems (tagged) and unsupervised learning issues (no tags) depending on whether the question is labeled.
Supervised learning can also be divided into regression problems (predictive values are continuous) and classification problems (predicted values are discrete) based on whether the predicted results are continuous.
Common supervised learning algorithms: linear regression, logistic regression, KNN, decision Tree, SVM, Naive Bayes.
Unsupervised Learning Algorithms: Association Rules, clustering
Semi-supervised learning: half tagged, half untagged.
using graphs for machine learning algorithms


You can use rules to learn when there is less data, and all machine learning algorithms cannot learn from them. If Connaught data multiple based on the results of continuous, discrete, there is no label is divided into regression and classification, clustering, if the dimension of the data is larger need to be dimensional processing. In the classification problem, depending on whether the data of the sample can be loaded into memory at once, the linear SVC,SGD can be carried out separately. machine learning problem Solving ideas What do I know when I get the data (visualization)? choosing the most appropriate machine learning algorithm Locating the Model state (over-fitting or under-fitting) and the workaround feature analysis and visualization of a large number of levels of data the advantages and disadvantages of various loss functions (loss function) and how to choose them Data and visualization

#numpy科学计算工具包
import NumPy as NP
#使用make_classification构造1000个样本, each sample has 20 feature from
sklearn.datasets Import make_classification
x,y=make_classification (1000,n_features=20,n_informative=2,
                       n_redundant=2,n_ classes=2,random_state=0)
#存为dataframe格式 from
Pandas import DataFrame
df=dataframe (Np.hstack (x,y[:, None]), Columns=range (+) +["class"])
Df[:6]

Import Matplotlib.pyplot as Plt
import Seaborn as SNS
#使用pairplot去看不同维度pair下数据的空间分布状况
_=sns.pairplot (DF [: 50],vars=[8,11,12,14,19],hue= "Class", size=1.5)
plt.show ()


We can see from the scatter chart and histogram that some of the dimensions are better differentiated than the other dimensions, such as the 11th and 14th dimensions are better. From these two dimensions, the data appears to be currently available. The 12 and 19 dimensions present a strong negative correlation.

Seaborn.pairplot (data,hue=none,hue_order=none,palette=none,vars=none,x_vars=none,y_vars=none,kind= ' scatter ', Diag_kind= ' hist ', Markers=none,size=2.5,aspect=1,dropna=true,plot_kws=none,diag_kws=none,grid_kws=none)
Data designationVARs: Use with data, otherwise use all variables of data. Parameter type: variable list of type numeric
{x,y}_vars: Use with data, otherwise use all variables of data. Parameter type: Numeric Type list Dropna: Otherwise, the missing value is excluded. Parameter type: boolean,optional Special Parameterskind:{' scatter ', ' reg '},optional kind of plot for the non-identity relationships diag_kind:{' hist ', ' KDE '},optional. Kind of plot for the diagonal subplots Basic ParametersSize: Default 6, the size of the graph (square). Parameter type: Numeric hue: Use the specified variable to paint the categorical variable. Parameter type: string Species/class hue_order:list of strings order for the levels of the hue variable in the Palette palette: palette Color Markers: Use a different shape. Parameter type: List aspect:scalar,optional. Aspect * Size gives the width (in inches) of each facet {PLOT,DIAG,GRID}_KWS: Specify additional parameters. Parameter type: dicts return

Pairgrid object 1. Scatter plot

From __future__ Import Division
import NumPy as NP
import Matplotlib.pyplot as Plt
import Seaborn as SNS
s Ns.set (style= ' ticks ', color_codes=true) iris=sns.load_dataset (
"Iris") Sns.pairplot (
Iris)
plt.show ( )

to specify a scatter plot of categorical variables

Sns.pairplot (iris,hue= "species")
plt.show ()

working with color palettes

Sns.pairplot (iris,hue= "species", palette= "HUSL")
plt.show ()

use a different shape

Sns.pairplot (iris,hue= "species", markers=["O", "s", "D"])
plt.show ()

3. Changing diagonal graphs

Using KDE

Sns.pairplot (iris,diag_kind= "KDE")
Plt.show ()

using regression

Sns.pairplot (iris,kind= "Reg")
plt.show ()

change the shape of the ground, use the parameters, use the Edgecolor

machine Learning Algorithm selection

We only have 1000 data samples, which is a classification problem, and is a supervised learning, so we use LINEARSVC (support vectors classification with linear kernel) according to the method we teach in the atlas. Note that LINEARSVC needs to choose a regularization method to alleviate the overfitting problem; Here we choose to use the most L2 regularization, and set the penalty factor C to 10. Let's rewrite the learning curve drawing function in Sklearn and draw the score on the training set and the cross-validation set.

From SKLEARN.SVM import linearsvc from sklearn.learning_curve import learning_curve #绘制学习曲线 to determine the status of the model Def Plot_learning_ Curve (Estimator,title,x,y,ylim=none,cv=none, Train_sizes=np.linspace (. 1,1.0,5)): "" "Draw data
    The learning curve on the dataset. Parameter Explanation----------Estimator: Your classifier title: Table title X: input feature,numpy type training vector shape (n_samples (number of samples), N_features (special ) y: The target vector of the input object is relative to the classification of x or the regression ylim:tuple format (ymin,ymax), setting the lowest and highest point CV of the image ordinate: When doing cross-validation, the data is divided into the number of copies, one of which acts as CV set, the remainder of n-1 as training (default 3) confirms that it is a few fold cross-validation train_size:array-like,shape (n_ticks), dtype float or int training set absolute or relative values, These quantities of samples will generate Larning_curve. If Dtype is float it will be considered as the scale of the maximum training set, N_jobs: number of parallel operations return value Train_size_abs: The sample count used to generate Learning_curve training set. Because the duplicate input will be deleted by the array shape= (n_unique_ticks) Train_scores: score on the training set, Test_scores: The score on the test set Linspace (1,10) will be 1-10 Equal interval 50 copies of Linspace (1,10,10) will be 1-10 intervals between 10 parts Np.mean (): To seek the mean value. Axis does not set the value, m*n the number of values, returns a real axis=0, compresses the rows, the average for each column, returns a 1*n matrix axis=1, compressed column, to a haha that mean value, return a m*1 matrix Fill_between, fill the area between two functions "" "Plt.figure () print (train_sizes) train_size, Train_scores,test_scores=learning_curve (estimator,x,y,cv=5,n_jobs=1,train_sizes=train_sizes) train_scores_mean= Np.mean (Train_scores,axis=1) train_scores_std=np.std (Train_scores,axis=1) Test_scores_mean=np.mean (Test_scores, Axis=1) test_scores_std=np.std (Test_scores,axis=1) Plt.fill_between (Train_sizes,train_scores_mean-train_scores_ STD,TRAIN_SCORES_MEAN+TRAIN_SCORES_STD, alpha=0.5,color= ' R ') Plt.fill_between (Train_sizes,test_sco RES_MEAN-TEST_SCORES_STD,TEST_SCORES_MEAN+TEST_SCORES_STD, alpha=0.5,color= ' G ') Plt.plot (train_siz Es,train_scores_mean, ' o ', color= ' R ', label= ' Training score ') Plt.plot (Train_sizes,test_scores_mean, ' O ', CO Lor= ' G ', label= "Cross-validation score") #plt. Legend () to label the meanings of various lines can be adjusted to the upper center best #plt. Grid () Show Grid #用于设置坐标轴的显示范围 PLT.XL on the graphAbel (' Training examples ') Plt.ylabel (' score ') plt.legend ("loc= ' Best") plt.grid (' on ') if Ylim:plt.y Lim (Ylim) plt.title (title) plt.show () #少量样本的情况绘出学习曲线 Plot_learning_curve (Linearsvc (c=10.0), "Linearsvc (c=10.0)", X, Y,ylim= (0.8,1.01), Train_sizes=np.linspace (. 05,0.2,5))


Although the test error increases with the increase of training set, the training error and the test error are still very large. This means that the model is in an over-fitted state. How overfitting solves the increase in sample size
The reason of overfitting is that the model is too hard to remember the distribution state of training samples, and the increase of sample capacity can make the distribution of training set more universality, and the influence of noise on the whole is decreased.

#增大训练样本的容量
Plot_learning_curve (Linearsvc (c=10), "Linearsvc (c=10)", x,y,ylim= (0.8,1.01),
                   train_sizes= Np.linspace (. 1,1.0,5))


Increasing the capacity of the sample can make the training error and the test error approximate equal, although the training accuracy rate is lower than the fit, but the test accuracy rate of more than 90%, more than the fit of less than 90%, there is better generalization ability, more close to the reality. Increase the sample size, the most direct method is to collect new data in the same application scenario, if it is not possible, but also on the basis of existing data to do some artificial processing to generate new data (such as in image recognition, we can do the image rotation, mirroring, etc.), of course, there is a certain risk, It is highly recommended to collect real data. Reduce the amount of features
For example, the previous data visualization shows that the 11th and 14 dimensional data are very useful for identifying categories, and we can use them only.

#减少特征的量
Plot_learning_curve (Linearsvc (c=10), "Linearsvc (c=10) Feature 11&14", x[:,[11,14]],y,ylim= ( 0.8,1.01),
                   train_sizes=np.linspace (0.2,1,5))


As can be seen from the above image, Overfitting has been alleviated. But this is what we observe, manually selected 11 and 14 dimensions. Can you do that automatically? You can also use the Traversal method for feature selection (provided that the dimension is not very high, otherwise it can be time consuming).

From Sklearn.pipeline import pipeline from
sklearn.feature_selection import selectkbest,f_classif
# Selectkbest (f_classif,k=2) chooses the best k=2 feature Plot_learning_curve (
Pipeline ("FS", Selectkbest (f_) according to Anova F-value classif,k=2)), #select  2 feature
                              ("Svc", Linearsvc (c=10.0))]),
                    "Selectkbest (f_classif,k=2) +linearsvc ( c=10) ",
                    x,y,ylim= (0.8,1.0),
                    train_sizes=np.linspace (0.05,0.2,5)
                   )


We do feature selection, which is to reduce the complexity of the model, and more difficult to describe the distribution of noise. From this angle (1) Reduce the number of polynomial in the polynomial model (2) Reduce the complexity of the model by reducing the number of layers in the neural network and the number of nodes per layer (3) increasing the Rfb-kernel band-width in SVM.
We do not recommend a dimension that is intended to be combined to reduce features.
General preference Use the following method: Enhanced regularization (this is said to reduce the value of C in Linearsvc)
Regularization is the most effective way to reduce overfitting

Plot_learning_curve (Linearsvc (c=0.1), ' Linearsvc (c=0.1) ', x,y,ylim= (0.8,1), Train_sizes=np.linspace (. 05,0.2,5))


Adjust the regularization coefficient, found that there is a certain degree of relief, but still what problem, our coefficients are self-finalized, there is no way to automatically select parameters. OK. I can do grid-search on cross-validation sets to find the best regularization coefficients (for large sample data, we still need to think about time, which can be a bit slow):

From Sklearn.grid_search import GRIDSEARCHCV
ESTM=GRIDSEARCHCV (Linearsvc (),
                 param_grid={"C": [ 0.001,0.01,0.1,1,10]})
Plot_learning_curve (estm, "Linearsvc (AUTO)",
                   x,y,ylim={0.8,1.0},
                   train_sizes =np.linspace (. 05,0.2,5))
print "Chosen params on datapoints:%s"%estm.fit (x[:500],y[:500]). Best_params_

Chosen params on datapoints:{' C ': 0.001}

For feature selection, we sklearn.feature_selection the process of selecting features in the Selectkbest, and also mentioned that in the case of high dimensions, the process is too slow. So do we have another way to do feature selection? For example, whether our classifier can identify which features are beneficial to the final result. Here's a little trick in practical work.
We know: L2 regularization, it is the final weight of the effect is, as far as possible to the weights to each dimension, do not let the weight concentration on some dimensions, there is a particularly high weight dimension. L1 regularization, it for the final weight of the effect is to let the weight of the feature is sparse, that is, the effect of the result is not so characteristic, simply do not take the weight.
Based on this theory, we can replace L2 regularization in Svc with L1 regularization to automatically identify which weights should leave weights.

Plot_learning_curve (Linearsvc (c=0.1,penalty= "L1", Dual=false), "Linearsvc (c=0.1)",
                   x,y,ylim= (0.8,1), Train_ Sizes=np.linspace (0.05,0.2,5))


Let's take a look at the last weights we get:

Est=linearsvc (c=0.1,penalty= "L1", Dual=false)
Est.fit (x[:450],y[:450]) #用450个点进行训练
print "cofficients learned:%s "%est.coef_
print" Non-zero coefficients:%s "%np.nonzero (EST.COEF_) [1]

Get results:

Cofficients learned:[[  0.00000000e+00   0.00000000e+00   0.00000000e+00  -3.22356818e-02
   - 1.66067083e-02   4.41395568e-03  -4.32411821e-02)   3.85080374e-02
    0.00000000e+00   0.00000000e+00   6.27285423e-02   1.22238201e+00)
    1.18925402e-01  -9.43028923e-04   0.00000000e+00   0.00000000e+00
    0.00000000e+00)   0.00000000e+00   9.27597250e-02   0.00000000E+00]]
non-zero coefficients:[3  4  5  6  7 10 11 12 13 18]

3 4 5 6 7 10 11 12 13 18-dimensional weights are obtained, and 18-dimensional weights are maximized, indicating that it has the greatest impact. Position and solution of under-fitting
We randomly generate a data [1000*20] of data (but distributions and previous changes), re-use LINEARSVC to classify

#构造一份环形数据 from
sklearn.datasets import make_circles
x,y=make_circles (n_samples=1000,random_state=2)
# Draw the Learning curve
Plot_learning_curve (linearsvc (c=0.25), ' Linearsvc (c=0.25) ', x,y,ylim= (. 5,1), Train_sizes=np.linspace (. 1,1,5))


The second classification problem, even if random guesses, the accuracy rate also has 0.5, this is not much higher than the random guessing.
Do not blindly collect more material, or adjust the regularization parameters. We can see from the learning curve that the training accuracy rate on the training set and the accuracy on the cross-validation set are very low, which actually corresponds to the state of the under-fitting.
We go back to our data and visualize the look

F=dataframe (Np.hstack (X,y[:,none)), Columns=range (2) +["class")
_=sns.pairplot (df,vars=[0,1],hue= "class", size=3.5)
plt.show ()


You will find that the data is not being split at all. Therefore, it is useless to find more data or adjust the regularization parameters.
How to solve the lack of fitting. Adjust your features (Find more effective features)
Let's make a map of the data first:

#加入原始特征的平方项作为新特征
X_extra=np.hstack ((x,x[:,[0]]**2+x[:,[1]]**2))
Plot_learning_curve (Linearsvc (c=0.25), "Linearsvc (c=0.25)", x_extra,y,ylim= (0.5,1), Train_sizes=np.linspace (. 1,1,5))


This indicates that the selection characteristics have a great influence on the accuracy of the results, so it is worthwhile to choose the proper characteristics. Use more complex models (such as non-linear kernel functions)
We tweak the model a little bit, using a complex nonlinear RBF kernel:

From SKLEARN.SVM Import svc
#note: We use the original X without the extra featrue
Plot_learning_curve (SVC (c=2.5,k Ernel= ' RBF ', gamma=1.0),
                    "SVC (c=2.5,kernel= ' RBF ', gamma=1.0"
                    , x,y,ylim= (0.8,1), Train_sizes=np.linspace (. 1,1,5))


It works great. about large Data sample sets and high-level feature spaces

This time we re-generate a piece of data, but this time we generate more data, higher-dimensional features, and 5 of categories. Model selection and learning curve in the case of big data
In the above data, if using linearsvc may be a bit slow, it is recommended to use Sgdclassifier in the Machine learning Atlas. In its essence, this model is also a model of a linear kernel function, and the difference is that it uses a random gradient descent to do the training, so the convergence rate is much faster when the entire data is not used. Sgdclassifier is very sensitive to the amplitude of the feature, so we should adjust the amplitude before we can get the data to it, of course, it is very convenient to do this with Sklearn.
Sgdclassifier each time just a part of the (Mini-batch) training, in this case, we do cross-validation is not very appropriate, we will use the corresponding progressive Validation:estimator will only take one to be trained at a time. After the evaluation, and then after the training, then do an evaluation on this batch to see if there is optimization.

#生成大样本, high-latitude data
x,y=make_classification (200000,n_features=200,n_informative=25,n_redundant=0,n_classes=10, class_sep=2,random_state=0)
#用SGDClassifier做训练 and draw the score difference from Sklearn.linear_model import for batch before and after training
Sgdclassifier
#est = sgdclassifier (penalty= "L2", alpha=0.001)
est=sgdclassifier (penalty= "L2", alpha=0.001)
progressive_validation_score=[]
TR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.