With the data, the rest is the work on the assembly line: using some machine learning algorithm to learn to get the model, using the model to predict, evaluate the performance of the model.
1 split training sets and test sets
Python's machine learning package Sklearn is very powerful and includes not only algorithms for supervised learning, unsupervised learning, but also functions for common preprocessing and other processes. The function of splitting the training set and the test set is simple, but it is also contained within the Sklearn package.
Typically x is used to represent a certain number of samples of the characteristic data, can be a python list type, then x is nested list, Len (X) will be the number of training samples, x[0] is a feature sample
Usually y is used to indicate the target value, in this problem, you need to predict whether a small square picture contains a number, the target value is true or False
From sklearn.cross_validation import Train_test_splitx_train, X_test, y_train, y_test = Train_test_split (X,y,test_size =0.2)
The above code indicates that 20% of the data is used for testing, and 80% of the data is used to train the model
2 Training Model
The machine learning algorithm of the Sklearn package has a very uniform interface, which means that a simple modification can be used to replace one learning algorithm with another.
From Sklearn import LINEAR_MODELCLF = Linear_model. Logisticregression ()
This is done using the simplest logistic regression, and after importing the package, class initialization is required to get a classifier classifier. Some learning algorithms need to initialize parameters when initializing, such as support vector machines need to specify the type of kernel, logistic regression does not need.
Clf.fit (X_train,y_train)
To train the model, different algorithms have completely different operations behind this step, but the interfaces in Sklearn are the same. This completes the model fitting fit/learning.
Y_predict = Clf.predict (x_test)
The Predict method of the classifier can predict the sample of the test set and return the result of the prediction y_predict
3 Evaluating the performance of a model
You don't have to worry about it at all, Sklearn Baoqian arranged it.
From Sklearn import metricsprint (' confusion matrix:\n%s '% Metrics.confusion_matrix (y_test,y_predict))
Systematic evaluation of model performance requires the use of a confusion matrix confusion matrix, which contains the number of true-positive, true-negative, false-positive, false-negative samples
The result of the output would be this:
|
|
Real value |
|
|
False |
True |
Predicted value |
False |
True Negative 76 |
False Negative 1 |
True |
False Positive 4 |
True Positive 56 |
Of course I added the header and description of the table.
A long time ago I was puzzled what true and false, masculine and feminine meanings seem to be too simple to write down in a book. But I think a lot of people still do not understand, perhaps the translation of Chinese has added a layer of understanding of the complexity.
Positive and negative refer to the results of the predictions.
Positive is predicted to be true, generally for the two categories to be classified as a rare category, such as suffering from a disease, the picture contains numbers, negative is predicted to false.
True mean prediction is correct, true negative true positive is the result of the prediction is negative, the prediction results are correct, the truth is negative; false refers to the prediction result error.
I didn't think of a simple logistic regression, and most predictions were "true", that is, the predictions were correct.
True-positive, true-negative, false-positive, false-negative evaluation models are not intuitive enough to further predict accuracy precision and recall (I call it detection rate).
Print (' Classification report for classifier%s:\n%s\n ' % (Clf,metrics.classification_report (y_test,y_predict)))
Precision Recall F1-score Support
False 0.95 0.99 0.97 77
True 0.98 0.93 0.96 60
Avg/total 0.96 0.96 0.96 137
Accuracy precision is the correct rate of prediction and can be divided into the correct rate of prediction result and the correct rate of false.
Precision (true) = True positive/(true positive + false positive)
The rate at which the recall is accurately predicted for real results,
Recall (true) = True positive/(true positive + false negative)
F1-score is also a comprehensive consideration of precision and recall results, as long as recall and precision have a close to 0 will be very small
f1-score=2precision*recall/(Precision+recall)
In general, the performance of logistic regression is unexpected, and all have 95+% prediction accuracy and detection rate.
4 Magic behind Behaid Magic
Logistic regression is a very simple linear model, which, like regression analysis, multiplies each eigenvalue by the corresponding constant, and the result is greater than 0, which is considered to contain a number, and the result is less than 0.
Clf.coef_
You can query the coefficients of the model, each of which corresponds to a feature of the image to be analyzed, which is a pixel
Array ([[ -2.56139945e-03, -2.75258012e-03, -5.31968495e-04, 3.97503197e-03, -4.44468541e-03, 6.81884865e-03, 5.02848092e-03, 3.71932312e-03, 5.39551134e-03, 9.25196949e-03, 4.36739083e-03, 7.08357030e-03, 5.84996428e-03, 5.05861727e-03, 5.86927009e-03, 2.12006563e-04, 1.83236348e-03, 2.86887367e-04, -1.60054788e-03, 6.11420888e-04, 1.08336757e-04, 2.49622737e-03, 3.74562382e-03, 6.13236412e-03, 3.33607269e-03, -3.24881692e-03, 5.74140904e-04, -1.22561879e-03, 4.37700792e-03, 1.76217248e-03, -1.24557500e-03, 2.61096358e-03, 2.30601120e-03, -2.83905385e-03, -1.19670904e-03, -8.19275158e-04, -6.44944632e-04, -5.05038691e-04, 5.52497690e-03, 2.05702811e-03, -2.43458886e-03, -2.83737410e-05, 6.78199654e-04, -1.28987251e-03, 4.56909934e-03, -1.01416535e-03, -4.23644789e-05, -2.83648771e-03, 1.68822571e-04, -7.60660440e-05, 3.36552860e-03, -1.11415804e-03, -9.63607637e-04, 3.31942394e-03, 5.72105593e-03 , -1.35952444e-05, -6.58437051e-04, -3.82020702e-04, -1.27826080e-03, -7.99044797e-04, -5.67146839e-03 , -4.25316734e-03, 1.83626714e-03, -2.78343826e-03, -2.07640734e-03, -3.49593939e-03, -1.70463105e-03 , -3.84863781e-03, 8.24664241e-04, 1.50409312e-03, 2.90331874e-03, -3.03167979e-03, 1.81563441e-03 , -1.52265512e-03, 3.36457675e-03, 4.81122573e-04, 2.26554206e-03, -2.35301784e-03, 8.52133302e-04 , -3.47625137e-03, -4.69526778e-03, -9.23085091e-04, -2.65283197e-03, -1.13519152e-03, -4.13610316e-03 , -9.66252318e-04, 7.55483133e-04, 3.15259161e-03, -5.27083518e-03, 2.07319627e-03, 1.03384540e-03 , 1.32133461e-03, -1.97213479e-03, 4.00445941e-03, -3.39089764e-03, 2.66239249e-04, -5.56404297e-04 , -8.03310870e-03,-3.00343377e-03, -6.56676988e-03, 2.26530299e-03, -3.93015386e-03, 2.89514964e-03, -3.20929410e-03,-2. 43164834e-03, 2.08445894e-03, 1.66398867e-03, -5.30108888e-03, -1.34884685e-03, 5.18203522e-04, 5. 01436351e-04, -6.67828433e-04, -1.91048336e-03, -9.78206074e-04, -1.01926859e-02, -3.76850966e-03,-4. 69293942e-03, -5.78586568e-03, -3.10393223e-03, -5.32801075e-03, 1.50549060e-03, 2.52518032e-03,-2. 37414841e-03, -9.49611291e-04, -1.01359478e-03, 5.64377850e-03, 2.53662479e-03, 2.08825692e-04,-3. 38701845e-03, 5.19172076e-04, -1.46759524e-03, -3.27752577e-04, 8.11867682e-04, 3.59883974e-03,-2. 54373438e-03, -5.90755272e-03, -4.83954063e-03, -9.04861523e-03, -1.38052393e-03, 9.42032985e-04, 1. 90854533e-03, -3.78755042e-03, 4.42240294e-04, -3.72275984e-05, 1.12836339e-03, -5.13071609e-04,-1. 38829079E-03, 6.76082019e-04, -1.43772760e-03, -5.17576299e-03, -7.29235584e-03, -3.08174424e-03, 2.12773740e-03, 3 .76542728e-03, -3.51670263e-04, -6.59119706e-03, -3.79001246e-03, -8.97712108e-04, 1.39573714e-03, 1 .03794597e-03, 7.38581141e-03, -6.09842283e-04, 4.28880895e-03, -5.85887446e-03, 2.75440511e-03, 4 .75591104e-03, 2.70668279e-03, -6.12732885e-03, -2.63392240e-03, -1.22843444e-03, -6.19796298e-03, 5 .48805535e-03, 7.95625236e-05, 6.97994750e-04, -2.60386848e-03, 3.67855058e-04, 3.14745357e-03, 2 .93531270e-03, 5.39926700e-03, 3.27031297e-03, 4.72582671e-03, 1.95212471e-03, -5.23686231e-03, 1 .04283598e-04, -4.96368657e-03, -1.41585781e-04, -2.21140099e-03, 1.20421926e-03, -5.00210160e-03, 1 .38909431e-03, -3.53666741e-03, 6.19806131e-04, 2.75729680e-03, 7.31164464e-04, -6.29700739e-03, 5 .93013031e-04, 6.04825323e-03, -2.84917346e-03, 7.99351601e-03, 4.47589057e-03, 3.10468824e-03, 2.51596859e-03, 2.57717 786e-04, -3.01800336e-03, 5.77308452e-04, 2.11532790e-03, 9.56314260e-06, -1.31857971e-03, 5.59309 822e-04, -3.46348089e-03, -3.18747290e-03, -1.23120806e-03, -3.74417132e-03, 4.91736080e-04, 1.06464 721e-03, 1.51992610e-03, 3.06938016e-03, 3.91741249e-03, 1.23027608e-02, 4.67528488e-04, 2.77043 461e-03, -9.76654188e-04, -1.07911245e-02, -5.08900112e-03, -2.32087989e-03, -6.48131799e-03, 5.74448 577e-03, -3.15094097e-04, 2.34358750e-03, -2.86364443e-03, -4.95540054e-04, 2.80312553e-03, 1.10865 982e-03, 1.44602453e-03, 5.76924197e-03, 7.13387692e-05, 9.62853757e-04, 8.73790791e-04, 5.19818 527e-03, 2.37576646e-03, 5.79825096e-04, 3.03416588e-05, -4.04365432e-03, -9.24804973e-04, 5.84764 772e-03, 1.99951794e-03, -2.93143644e-03, 1.33716200e-04, -7.73417123e-05, 6.13021426e-03, -1.17824922e-03, 4.51548244e-0 3, 2.01647381e-03, 3.35221498e-03, 2.92103954e-03, -1.65440967e-03, -1.84581127e-03, -6.64682948e-0 3, 3.89793301e-04, 3.35493706e-03, -3.56240877e-03, -6.02756394e-03, -1.53553401e-03, 1.32827858e-0 3, -3.74875999e-03, -3.36515528e-03, 9.11400046e-04, 4.68510214e-03, 3.81594242e-03, 4.43658898e-0 3, 1.53881614e-03, 2.82551066e-03, 1.53655132e-03, 2.54271293e-03, -2.69429440e-03, 5.69739019e-0 4, 2.15592781e-03, -2.27916466e-03, -1.49687487e-03, 5.19139428e-05, 2.81137298e-03, -1.22697041e-0 3, -1.15348586e-03, -2.14934244e-03, 4.47759284e-04, -1.00424467e-03, 2.08708304e-03, 2.75652879e-0 3, 3.38016036e-03, 2.33732861e-03, 1.46094873e-03, 7.35184525e-03, 1.50941689e-03, -7.40743881e-0 4, -1.61913515e-Geneva, -5.25394588e-03, 2.72163685e-03, -2.78606942e-03, -4.16177660e-03, -2.68669363e-03, -5.19658681e- Geneva, -5.16910663e-03, -2.63763143e-03, -3.62434399e-03, -1.02610653e-03, -9.10417767e-04, 2.47384678e- Geneva, 1.47010078e-03, 3.97049150e-03, 1.59091370e-03, -2.15634092e-03, -1.71951499e-03, -1.77312622e- Geneva, -4.59520849e-03, 4.11194688e-03, -6.92854270e-03, -4.37689748e-03, -5.21307441e-03, -2.63619132e- Geneva, 4.23279802e-03, -2.26747150e-03, 2.06543571e-03, 5.32133709e-03, 2.70080747e-03, 3.30225323e- Geneva, 5.25671231e-03, 2.49122812e-03, 4.64310922e-03, -4.76939533e-03, -3.57712728e-03, 4.47400505e- Geneva, -3.04562602e-03, -5.72868439e-03, 1.66318591e-04, -1.04108616e-03, -2.03108548e-03, -4.74736009e- Geneva, 1.72270514e-03, 1.11208635e-03, 4.40334390e-04, -2.48325165e-03, 5.50780677e-03, 3.64594260e- 1.94247691E-03,2.73757992e-03, 4.95431117e-03, -1.04369763e-03, -4.29123006e-04, 3.26100602e-03, -5.71818631e-03,- 4.65439326e-03, -8.02655959e-03, -3.45473022e-03, 4.02523699e-03, 3.40183005e-03, -1.02538809e-03,- 6.02059733e-04, 1.33523607e-03, -1.83807032e-03, 3.93462664e-03, -2.54725629e-03, 1.42075384e-03, 4.34467357e-03, 4.95094928e-03, 1.77358310e-04, -4.56130544e-03, 1.25794322e-03, 5.90246498e-04, 4.63643052e-04, -7.69648223e-03, -2.80739980e-03, -6.78112020e-03, -2.14894858e-03, -1.81469401e-03,- 2.11669943e-03, 4.02096206e-03, 2.51420874e-03, 1.80614576e-04, 1.51796043e-03, 3.86622406e-03, 1.59411717e-03, 2.20409364e-03, 6.29895833e-04, -4.22056706e-03, -4.09798177e-03, 4.16094897e-04,- 8.46606579e-04, -2.68709134e-03, -2.69588890e-03, -2.82040061e-03, -1.34632735e-03, 7.53324811e-04,- 8.18104595E-04, 1.64211467e-03, 4.54944121e-03, 1.27077831e-03, -4.71765564e-03, 7.77776618e-04, 9.94912884e-04, 5.11114494e-05, 2.67684556e-04, -2.82292759e-03, -3.96944658e-03, -6.13793890e-03, -6.52427326e-04, -1.85522869e-03, -1.42620355e-03, 2.70045514e-04, 3.07247472e-03, 3.54542386e-03, 5.50694470e-03, -1.48702671e-03, 1.19550942e-03, 2.22658765e-03, 2.10573442e-03, -5.88441942e-05, 2.21007257e-03, -1.07699489e-04, -4.54425504e-03, -1.07385611e-03, -3.67573528e-04, 3.44609201e-04, -3.16044812e-03, -3.36530877e-03, -3.95622536e-03, 1.43149147e-03, -3.31763110e-03, -3.44537238e-03, -2.35134639e-03, 1.79640507e-03, 1.02597557e-03, 3.45353045e-03, -4.86053025e-03, 1.83903418e-03, -4.08906445e-04, -1.86879935e-05, 1.56767365e-03, 4.69210716e-04, 1.56072497e-03, 3.19265117e-03, -3.30414162e-03, 9.49158185E-05, 3.12229776e-03, -5.12022873e-03, -5.85707486e-03, -2.29236547e-03, -6.34190433e-03, -7.35152452e-03, 8.719 00345e-04, 3.36665007e-03, -4.83359118e-03, -4.07594388e-03, 1.90616778e-03, 2.77873920e-03, 4.972 90557e-03, 7.76535909e-03, 4.32362637e-03, 1.29321850e-03, -3.67396968e-03, -8.44682654e-04, 3.282 71837e-03, -1.20730993e-03, -1.96092533e-03, 4.12536967e-03, 1.37496600e-03, 3.62493853e-03, 5.824 27193E-03, -7.32347050e-03]])
If you do not believe the effect is so good, you can see a verification code is not split into small squares, the results of the prediction
A red border indicates that the small square contains a number
It must be said that the first day in the Department of the feeling really good
Identification code: Where to look for numbers (ii)