Note: This tutorial is I try to use scikit-learn some experience, Scikit-learn really super easy to get started, simple and practical. 30 minutes learning to call the basic regression method and the integration method should be enough.
This article mainly refers to the official website of Scikit-learn.
Preface: This tutorial mainly uses the most basic function of numpy, used to generate data, matplotlib used for drawing, Scikit-learn is used to call machine learning method. If you're not familiar with them (I'm not familiar with it), it's okay to look at NumPy and Matplotlib's simplest tutorials. Our program for this tutorial does not exceed 50 lines 1. Data Preparation
For experimentation, I wrote a two-dollar function, Y=0.5*np.sin (x1) + 0.5*np.cos (x2) +0.1*x1+3. The value range of the X1 is the 0~50,X2 range is -10~10,x1 and X2 training set a total of 500, the test set has 100. Among them, a -0.5~0.5 noise is added to the training set. The code for generating the function is as follows:
def f (x1, x2):
y = 0.5 * Np.sin (x1) + 0.5 * Np.cos (x2) + 0.1 * x1 + 3
return y
def load_data ():
X1_tra in = Np.linspace (0,50,500)
X2_train = Np.linspace ( -10,10,500)
data_train = Np.array ([[X1,x2,f (X1,X2) + ( Np.random.random (1) -0.5)] x1,x2 in Zip (X1_train, X2_train)])
x1_test = Np.linspace (0,50,100) + 0.5 * Np.random.random (+)
x2_test = Np.linspace ( -10,10,100) + 0.02 * Np.random.random (+)
data_test = Np.array ([[ X1,x2,f (X1,X2)] for x1,x2 in Zip (X1_test, x2_test)])
return data_train, Data_test
The image of the training set (the random noise with -0.5~0.5 on y) and the test set (no noise) are as follows:
2. The simplest introduction of Scikit-learn.
Scikit-learn is very simple, just instantiate an algorithm object, then call the Fit () function, then fit, you can use the Predict () function to predict, and then use the score () function to evaluate the difference between the predicted value and the real value, the function returns a score. For example, the method of invoking a decision tree is as follows
in [6]: from Sklearn.tree import decisiontreeregressor in [7]: CLF = Decisiontreeregressor () in
[8]: Clf.fit (X_train,y_train) out[11]: decisiontreeregressor (criterion= ' MSE ', Max_depth=none, Max_features=none, Max_leaf_nodes=none, Min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, Presort=False, R Andom_state=none, splitter= ' best ') in [all]: result = Clf.predict (x_test) in [+]: Clf.score (x_test,y_test) out [+]: 0.96352052312508396 in [+]: Result out[17]: Array ([2.44996735, 2.79065744, 3.21866981, 3.20188779, 3.04219101 , 2.60239551, 3.35783805, 2.40556647, 3.12082094, 2.79870458, 2.79049667, 3.62826131, 3.66788213, 4 .07241195, 4.27444808, 4.75036169, 4.3854911, 4.52663074, 4.19299748, 4.42235821, 4.48263415, 4.161 92621, 4.40477767, 3.76067775, 4.35353213, 4.6554961, 4.99228199, 4.29504731, 4.55211437, 5.08229167,
Next, we can draw an image based on the predicted value and the truth. The code for drawing is as follows:
Plt.figure ()
Plt.plot (Np.arange (len (Result)), y_test, ' go-', label= ' true value ')
plt.plot (Np.arange (Len ( result), result, ' ro-', label= ' predict value ')
plt.title (' Score:%f '%score)
plt.legend ()
plt.show ()
The image is then displayed as follows:
3. Start experimenting with various regression methods
To speed up the test, a function is written that takes the object of a different regression class, and then it draws the image and gives the score.
The functions are basically as follows:
def try_different_method (CLF):
clf.fit (x_train,y_train)
score = Clf.score (X_test, y_test)
result = Clf.predict (x_test)
plt.figure ()
Plt.plot (Np.arange (len (Result)), y_test, ' go-', label= ' true value ')
Plt.plot (Np.arange (result), result, ' ro-', label= ' predict value ')
plt.title (' Score:%f '%score)
plt.legend ()
plt.show ()
Train, test = Load_data ()
x_train, Y_train = Train[:,:2], train[:,2] #数据前两列是x1, x2 the third column is Y, here Y has random noise
x_test, Y_test = Test[:,:2], test[:,2] # ditto, but y doesn't have noise here.
3.1 General regression methods
Conventional regression methods include linear regression, decision tree regression, SVM and K-nearest neighbor (KNN) 3.1.1 Linear regression .
In [4]: from Sklearn import Linear_model in
[5]: Linear_reg = Linear_model. Linearregression () in
[6]: Try_different_method (Linar_reg)
3.1.2 Number regression
From Sklearn import tree
Tree_reg = tree. Decisiontreeregressor ()
Try_different_method (Tree_reg)
The image of the decision tree regression is then displayed:
3.1.3 SVM regression
In [7]: From Sklearn import, SVM in
[8]: SVR = SVM. SVR () in
[9]: Try_different_method (SVR)
The resulting image is as follows:
3.1.4 KNN
In [all]: from Sklearn import neighbors in
[]: KNN = neighbors. Kneighborsregressor () in
[+]: Try_different_method (KNN)
Even KNN, the worst computational algorithm, works best.
3.2 Integrated methods (random forest, AdaBoost, GBRT) 3.2.1 Random Forest
in [+]: from Sklearn Import Ensemble in
[+]: RF =ensemble. Randomforestregressor (n_estimators=20) #这里使用20个决策树 in
[+]: Try_different_method (RF)
3.2.2 Adaboost
In []: Ada = ensemble. Adaboostregressor (N_ESTIMATORS=50) in
[+]: Try_different_method (ADA)
The image is as follows:
3.2.3 GBRT
In []: GBRT = ensemble. Gradientboostingregressor (n_estimators=100) in
[+]: Try_different_method (GBRT)
The image is as follows
4. There are many other methods of Scikit-learn, which can be tested by the user manual. 5. Complete code
I write the code here in Pycharm, but in the Pycharm does not display graphics, so you can copy the code into Ipython, using the%paste method to copy the code slice. The
then imports the algorithm with reference to each of the above methods and uses the Try_different_mothod () function to paint. The
complete code is as follows:
Import NumPy as NP import Matplotlib.pyplot as Plt def f (x1, x2): y = 0.5 * Np.sin (x1) + 0.5 * Np.cos (x2) + 3 + 0.1 * X1 return y def load_data (): X1_train = Np.linspace (0,50,500) X2_train = Np.linspace ( -10,10,500) data_t Rain = Np.array ([[X1,x2,f (X1,X2) + (Np.random.random (1) -0.5)] for x1,x2 in Zip (X1_train, X2_train)]) X1_test = Np.lins Pace (0,50,100) + 0.5 * Np.random.random (+) X2_test = Np.linspace ( -10,10,100) + 0.02 * Np.random.random (+) Data_ Test = Np.array ([[X1,x2,f (X1,X2)] for x1,x2 in Zip (X1_test, x2_test)]) return Data_train, Data_test train, test = Loa D_data () x_train, Y_train = Train[:,:2], train[:,2] #数据前两列是x1, x2 the third column is Y, here Y has random noise x_test, y_test = Test[:,:2], test[:,2] # Ditto, but y here does not have noise def try_different_method (CLF): Clf.fit (x_train,y_train) score = Clf.score (X_test, y_test) Resul t = clf.predict (x_test) plt.figure () Plt.plot (Np.arange (len (Result)), y_test, ' go-', label= ' true value ') Plt.plo T (Np.arange (len (Result)), result, ' ro-', label= ' predict value ') Plt.title (' Score:%f '%score) plt.legend () plt.show ()