The general process of traditional machine learning tasks from start to modeling is: Get data, data preprocessing, training modeling, model evaluation, classification.
1. Get Data 1.1 import Sklearn datasets
Sklearn contains a large number of high-quality data sets, in the process of learning machine learning, you can use these datasets to achieve different models, thereby improving your hands-on ability, and this process can deepen your understanding of theoretical knowledge and grasp. (This step I also need to strengthen, together refueling!) ^-^)
First of all, to use a dataset in Sklearn, you must import the Datasets module:
From Sklearn import datasets
Contains most of the data sets in the Sklearn, and the invocation is given in the figure, here we take the iris data to give an example:
Iris = Datasets.load_iris () # import DataSet x = Iris.data # get its eigenvectors y = iris.target # Get Sample Label
1.2 Creating datasets
In addition to using Sklearn's own data set, you can also create your own training samples, see the DataSet loading utilities for details, here's a brief introduction to Sklearn samples Generator contains a number of ways to create sample data:
Let's take an example of a sample generator for classification problems:
From sklearn.datasets.samples_generator import make_classificationx, y = make_classification (n_samples=6, n_features= 5, n_informative=2, n_redundant=2, n_classes=2, n_clusters_per_class=2, scale=1.0, random_state=20) # N_ Samples: Specify number of Samples # N_features: Specify number of features # n_classes: Specify several classifications # Random_state: Random seed, which makes the random shape heavier
>>> for X_,y_ in Zip (x, y): print (y_,end= ': ') print (x_) 0: [ -0.6600737 -0.0558978 0.82286793 1.1003977 -0.93493796]1: [0.4113583 0.06249216-0.90760075-1.41296696 2.059838 ] 1: [1.52452016-0.01867812 0.20900899 1.34422289-1.61299022]0: [ -1.25725859 0.02347952-0.28764782 -1.32091378-0.88549315]0: [ -3.28323172 0.03899168-0.43251277-2.86249859-1.10457948]1: [1.68841011 0.06754955-1.02805579-0.83132182 0.93286635]
2. Data preprocessing
The data preprocessing phase is an indispensable part of machine learning, which makes the data more efficient to be identified by the model or evaluator. Let's take a look at some of the functions we usually use in Sklearn:
From Sklearn Import preprocessing
2.1 Normalization of data
In order to synchronize the standardized rules of the training data with the standardized rules of the test data, many scaler are provided in preprocessing:
data = [[0, 0], [0, 0], [1, 1], [1, 1]]# 1. Standardized scaler based on mean and STD = preprocessing. Standardscaler (). Fit (Train_data) scaler.transform (train_data) Scaler.transform (test_data) # 2. Each eigenvalue is normalized to a fixed range of scaler = preprocessing. Minmaxscaler (feature_range= (0, 1)). Fit (Train_data) scaler.transform (train_data) Scaler.transform (test_data) # Feature_range: Define a normalized scope, in which the note () is enclosed.
2.2 Regularization (
normalize
)
One of the necessary operations when you want to calculate the similarity of two samples is regularization. The idea is that the P-norm of the sample is first calculated, and then all the elements of the sample are divided by that norm, so that the norm for each sample is 1.
>>> X = [[1., -1., 2.],... [2., 0., 0.],... [0., 1.,-1.] >>> x_normalized = preprocessing.normalize (X, norm= ' L2 ') >>> x_normalized Array ([[0.40 ...,- 0.40 ..., 0.81 ... ], [1. ..., 0. ..., 0. ...], [0. ..., 0.70 ...,-0.70 ...])
2.3 One-hot Encoding
One-hot encoding is a method of encoding discrete eigenvalues, which is commonly used in LR models to add nonlinear capabilities to a linear model.
data = [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]encoder = preprocessing. Onehotencoder (). Fit (data) enc.transform (data). ToArray ()
3. Data set splitting
When we get the training data set, we often split the training datasets further into training sets and validation sets, which helps us to select the model parameters.
# function: Divide the dataset into training and test sets # format: Train_test_split (*arrays, **options) from sklearn.mode_selection import train_test_splitx_ Train, X_test, y_train, y_test = Train_test_split (X, y, test_size=0.3, random_state=42) "" "Parameter---Arrays: sample array, Test sample (default: 0.25) int-How many test samples are obtained train_size: Same as Test_sizerandom_state:int-random seed (seed fixation, experimental test_size:float-) containing eigenvectors and tags Now) shuffle-whether to shuffle the data before splitting (default true) returns the---split list, length =2*len (arrays), (Train-test split) "" "
4. Defining the Model
In this step we first analyze the type of our data, figure out what model you want to use, and then we can define the model in Sklearn. Sklearn provides a very similar interface for all models, which allows us to become more familiar with the usage of all models. Before that, let's take a look at the common properties and functions of the model:
# Fit Model Model.fit (X_train, Y_train) # Model Prediction Model.predict (x_test) # Get the parameters of this model model.get_params () # to score the model Model.score (Data_ X, data_y) # linear regression: R Square; classification problem: ACC
4.1 Linear regression
From Sklearn.linear_model import linearregression# defining the linear regression model = Linearregression (Fit_intercept=true, normalize= False, copy_x=true, N_jobs=1) "" "Parameter--- fit_intercept: Whether the intercept is computed. false-model has no intercept normalize: When Fit_intercept is set to False, the parameter is ignored. If true, the regression coefficient x before regression is normalized by subtracting the mean and dividing by the l2-norm. n_jobs: Specify the number of threads "" "
4.2 Logistic regression LR
From Sklearn.linear_model import logisticregression# defining logistic regression model = logisticregression (penalty= ' L2 ', Dual=false, tol= 0.0001, c=1.0, fit_intercept=true, Intercept_scaling=1, Class_weight=none, Random_state=none , solver= ' Liblinear ', max_iter=100, multi_class= ' OVR ', verbose=0, Warm_start=false, N_jobs=1) "" "Parameter--- Penalty: Using the specified regularization item (default: L2) dual:n_samples > N_features false (Default) C: The inverse of the regularization strength, the smaller the value the greater the regularization intensity n_jobs: Specify the number of threads random_state: Random number generator fit_intercept: Whether constant "" "is required
4.3 Naive Bayesian algorithm NB
From Sklearn import Naive_bayesmodel = Naive_bayes. GAUSSIANNB () # Gausbeyes model = Naive_bayes. MULTINOMIALNB (alpha=1.0, Fit_prior=true, class_prior=none) model = Naive_bayes. BERNOULLINB (alpha=1.0, binarize=0.0, Fit_prior=true, Class_prior=none) "" "Text classification problem common MULTINOMIALNB parameters--- Alpha: Smoothing parameter Fit_prior: whether to learn a priori probability of a class, false-uses a uniform priori probability Class_prior: Whether to specify a priori probability of a class, or to adjust binarize according to parameters if specified : The threshold value of the binary, if none, the input is assumed to consist of a binary vector "" "
4.4 Decision Tree DT
From Sklearn import tree model = tree. Decisiontreeclassifier (criterion= ' Gini ', Max_depth=none, min_samples_split=2, Min_samples_leaf=1, Min_weight_ fraction_leaf=0.0, Max_features=none, Random_state=none, Max_leaf_nodes=none, min_impurity_decrease=0.0, Min_impurity_split=none, Class_weight=none, Presort=false) "" "Parameter--- criterion: Feature selection criteria gini/entropy max _depth: The maximum depth of the tree, none-as far as possible Min_samples_split: Split the internal node, the minimum required sample tree Min_samples_leaf: leaf node required minimum number of samples Max_ Features: Maximum feature number when searching for optimal split point max_leaf_nodes: priority growth to maximum leaf node number min_impurity_decrease: If this separation results in a reduction of impurities greater than or equal to this value, The node is split. """
4.5 Support Vector Machine SVM
From SKLEARN.SVM Import Svcmodel = SVC (c=1.0, kernel= ' RBF ', gamma= ' auto ') "" "Parameter--- C: The penalty parameter for error term C gamma: nuclear correlation coefficient. Floating-point number, If gamma is ' auto ' then 1/n_features'll be used instead. ""
4.6 k Nearest Neighbor algorithm KNN
The Sklearn import neighbors# defines the KNN classification model = neighbors. Kneighborsclassifier (N_neighbors=5, N_jobs=1) # Classification model = neighbors. Kneighborsregressor (N_neighbors=5, N_jobs=1) # regression "" "Parameter--- n_neighbors: Number of neighbors n_jobs: Number of parallel Tasks" ""
4.7 Multi-layer sensing machine (neural network)
The From Sklearn.neural_network import mlpclassifier# defines the multilayer perceptron classification Algorithm model = mlpclassifier (activation= ' Relu ', solver= ' Adam ', alpha=0.0001) "" "Parameter--- hidden_layer_sizes: ganso activation: Activation function Solver: Optimization algorithm {' Lbfgs ', ' sgd ', ' Adam '} ALPHA:L2 penalty (regularization item) parameter. """
5. Model evaluation and Selection Chapter 5.1 cross-validation
From sklearn.model_selection import cross_val_scorecross_val_score (model, X, Y=none, Scoring=none, Cv=none, N_jobs=1) " "" Parameter--- model: Models for fitting data Cv:k-fold scoring: Scoring parameters-' accuracy ', ' F1 ', ' precision ', ' recall ', ' Roc_auc ', ' neg_ ' Log_loss ' et cetera ' ""
5.2 Test Curve
Using the test curve, we can change the model parameters more conveniently and get the model performance.
From sklearn.model_selection import validation_curvetrain_score, Test_score = Validation_curve (model, X, y, Param_name, Param_range, Cv=none, Scoring=none, N_jobs=1) "" "Parameter--- Model: Object X for fit and predict , Y: Feature and label of training set PARAM_ Name: Param_range of the parameter changed : cv:k-fold return value--- train_score: Training set Score (array) Test_score: Validation set score (array) "" "
Example
6. Save the Model
Finally, we can save our trained model to the local, or put it on the line for the user to use, then how to save the trained model? There are two main ways:
6.1 Save As Pickle file
Import pickle# Save model with open (' Model.pickle ', ' WB ') as F: Pickle.dump (model, F) # Read model with open (' Model.pickle ', ' RB ') a s f: model = Pickle.load (f) model.predict (X_test)
6.2 Sklearn Self-bringing method joblib
From sklearn.externals import joblib# save model Joblib.dump (model, ' model.pickle ') #载入模型model = joblib.load (' Model.pickle ')
Use of Sklearn