the data in the Scikit-learn
data Format : 2-D array or matrix, [N_samples, N_features]
contains DataSet: Iris data, digits data, Boston data (housing price), diabetes data for example:
From sklearn.datasets import Load_iris
>>> iris = Load_iris ()--> which contains Iris.data and Iris.target
We can go through print (data. DESCR) To view more information about the dataset
the basic principle of machine learning in Scikit-learn
linear regression :
From Sklearn.linear_model import linearregression
Parameters in the model can be initialized, such as:
Model = Linearregression (normalize=true)
When given the training data x and Y, the model fit requires only the invocation of:
Model.fit (X, y)
In addition, you can see the training coefficients by calling the model's COEF_ value
Nearest neighbor algorithm :
From Sklearn Import Neighbors
The neighbors contains the KNN algorithm model, which is set by the following call (parameter sets the number of nearest neighbor N):
KNN = neighbors. Kneighborsclassifier (N_neighbors=1)
knn.fit (X, y)
Because the KNN algorithm does not need to train, the forecast sample can find the nearest sample directly through the given sample to classify accordingly:
Knn.predict (x), for example x = [[3, 5, 4, 2]]
linear SVM Classification :
From SKLEARN.SVM import linearsvc
Linearsvc (loss= ' L1 ') or L2
From the above two examples can be seen, we have different types of algorithms "estimator" given to model variables, model in the training samples to learn, only need to call Model.fit (X, y);
For a supervised evaluator, the way to predict new data is: Model.predict (x_new)
For classification problems, some evaluators provide the Model.predict_proba () method, which returns the probability of each category, and the most likely category corresponds to the model.predict ()
For unsupervised evaluator, the feature can be converted, unsupervised conversion refers only to the conversion of statistical information using features, including mean value, standard deviation, boundary and so on, such as standardization, PCA method dimensionality reduction and so on.
For example, the difference between model.transform () and Model.fit_transform (x, y = None) is
Fit_transform need to fit the data first, the fitting described here is not the kind of fitting that contains the target Y, but rather the corresponding statistic information, such as mean and standard deviation, according to the given data .
and transform is generally used for testing data, do not need to be fitted, but directly using training data to fit good statistical information, such as mean and standard deviation, the test data processing;
Other model methods need to be used to check again.
Data dimensionality reduction PCA
PCA, principal component analysis, the data can be reduced dimension, in the case of hand characters:
From sklearn.decomposition import PCA
>>> PCA = PCA (n_components=2) # #降至2个维度
>>> proj = Pca.fit_transform (digits.data) # #
Gauss naive Bayesian classification
Gauss naive Bayesian Classification method is a simple and fast method, if the simple and fast method enough to make the results satisfactory, then do not waste too much CPU resource design complex algorithm-->sklearn.naive_bayes. Gaussiannb
Gauss Naive Bayes makes a Gaussian fitting of each data of the same label, and then makes a rough classification of the test data, although the fitting of the real world is not very accurate, but also very good, especially for the text data
From Sklearn.naive_bayes import GAUSSIANNB
From sklearn.model_selection import Train_test_split
Train_test_split can automatically randomly divide data into training sets and test sets:
X_train, X_test, y_train, y_test = Train_test_split (Digits.data, Digits.target)
This algorithm calls the method to be consistent with the above, the concrete use to study the parameters:
CLF = GAUSSIANNB ()
Clf.fit (X_train, Y_train)
When testing
predicted = Clf.predict (x_test)
Quantitative Analysis of the results
There are a number of mature metrics in the module sklearn.metrics:
From Sklearn import metrics
>>> print (Metrics.classification_report (expected, predicted))
For the classification of the evaluation, will return precision precision, recall recall, f1-score and support
The other is a confusing matrix, which is invoked as follows:
Metrics.confusion_matrix (expected, predicted)
can help us see the error points of each category
Sometimes we can draw the relationship between each dimension's features and results, and manually select useful features.
gradient boosting tree regression
GBT is a very powerful regression tree.
From sklearn.ensemble import Gradientboostingregressor
CLF = Gradientboostingregressor ()
clf.fit (X_train, y_train)
predicted = Clf.predict (x_test)
Other than that:
(regression) Decision tree decision
From Sklearn.tree import Decisiontreeregressor
Cond......