Use of Sklearn

Source: Internet
Author: User
Tags random seed shuffle svm

The general process of traditional machine learning tasks from start to modeling is: Get data, data preprocessing, training modeling, model evaluation, classification.

1. Get Data 1.1 import Sklearn datasets

Sklearn contains a large number of high-quality data sets, in the process of learning machine learning, you can use these datasets to achieve different models, thereby improving your hands-on ability, and this process can deepen your understanding of theoretical knowledge and grasp. (This step I also need to strengthen, together refueling!) ^-^)

First of all, to use a dataset in Sklearn, you must import the Datasets module:

From Sklearn import datasets

Contains most of the data sets in the Sklearn, and the invocation is given in the figure, here we take the iris data to give an example:

  

Iris = Datasets.load_iris () # import DataSet x = Iris.data # get its eigenvectors y = iris.target # Get Sample Label
1.2 Creating datasets

In addition to using Sklearn's own data set, you can also create your own training samples, see the DataSet loading utilities for details, here's a brief introduction to Sklearn samples Generator contains a number of ways to create sample data:

Let's take an example of a sample generator for classification problems:

From sklearn.datasets.samples_generator import make_classificationx, y = make_classification (n_samples=6, n_features= 5, n_informative=2,     n_redundant=2, n_classes=2, n_clusters_per_class=2, scale=1.0,     random_state=20) # N_ Samples: Specify number of Samples # N_features: Specify number of features # n_classes: Specify several classifications # Random_state: Random seed, which makes the random shape heavier
>>> for X_,y_ in Zip (x, y):    print (y_,end= ': ')    print (x_)    0: [ -0.6600737  -0.0558978   0.82286793  1.1003977  -0.93493796]1: [0.4113583   0.06249216-0.90760075-1.41296696  2.059838  ] 1: [1.52452016-0.01867812  0.20900899  1.34422289-1.61299022]0: [ -1.25725859  0.02347952-0.28764782 -1.32091378-0.88549315]0: [ -3.28323172  0.03899168-0.43251277-2.86249859-1.10457948]1: [1.68841011  0.06754955-1.02805579-0.83132182  0.93286635]
2. Data preprocessing

The data preprocessing phase is an indispensable part of machine learning, which makes the data more efficient to be identified by the model or evaluator. Let's take a look at some of the functions we usually use in Sklearn:

From Sklearn Import preprocessing
2.1 Normalization of data

In order to synchronize the standardized rules of the training data with the standardized rules of the test data, many scaler are provided in preprocessing:

data = [[0, 0], [0, 0], [1, 1], [1, 1]]# 1. Standardized scaler based on mean and STD = preprocessing. Standardscaler (). Fit (Train_data) scaler.transform (train_data) Scaler.transform (test_data) # 2. Each eigenvalue is normalized to a fixed range of scaler = preprocessing. Minmaxscaler (feature_range= (0, 1)). Fit (Train_data) scaler.transform (train_data) Scaler.transform (test_data) # Feature_range: Define a normalized scope, in which the note () is enclosed.
2.2 Regularization ( normalize

One of the necessary operations when you want to calculate the similarity of two samples is regularization. The idea is that the P-norm of the sample is first calculated, and then all the elements of the sample are divided by that norm, so that the norm for each sample is 1.

>>> X = [[1., -1.,  2.],...      [2.,  0.,  0.],...      [0.,  1.,-1.] >>> x_normalized = preprocessing.normalize (X, norm= ' L2 ') >>> x_normalized                                      Array ([[0.40 ...,- 0.40 ..., 0.81 ...  ],       [1.  ...,  0.  ...,  0.  ...],       [0.  ...,  0.70 ...,-0.70 ...])
2.3 One-hot Encoding

One-hot encoding is a method of encoding discrete eigenvalues, which is commonly used in LR models to add nonlinear capabilities to a linear model.

data = [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]encoder = preprocessing. Onehotencoder (). Fit (data) enc.transform (data). ToArray ()
3. Data set splitting

When we get the training data set, we often split the training datasets further into training sets and validation sets, which helps us to select the model parameters.

# function: Divide the dataset into training and test sets # format: Train_test_split (*arrays, **options) from sklearn.mode_selection import train_test_splitx_ Train, X_test, y_train, y_test = Train_test_split (X, y, test_size=0.3, random_state=42) "" "Parameter---Arrays: sample array, Test sample (default: 0.25) int-How many test samples are obtained train_size: Same as Test_sizerandom_state:int-random seed (seed fixation, experimental test_size:float-) containing eigenvectors and tags Now) shuffle-whether to shuffle the data before splitting (default true) returns the---split list, length =2*len (arrays), (Train-test split) "" "
4. Defining the Model

In this step we first analyze the type of our data, figure out what model you want to use, and then we can define the model in Sklearn. Sklearn provides a very similar interface for all models, which allows us to become more familiar with the usage of all models. Before that, let's take a look at the common properties and functions of the model:

# Fit Model Model.fit (X_train, Y_train) # Model Prediction Model.predict (x_test) # Get the parameters of this model model.get_params () # to score the model Model.score (Data_ X, data_y) # linear regression: R Square; classification problem: ACC
4.1 Linear regression
From Sklearn.linear_model import linearregression# defining the linear regression model = Linearregression (Fit_intercept=true, normalize= False,     copy_x=true, N_jobs=1) "" "Parameter---    fit_intercept: Whether the intercept is computed. false-model has no intercept    normalize: When Fit_intercept is set to False, the parameter is ignored. If true, the regression coefficient x before regression is normalized by subtracting the mean and dividing by the l2-norm.     n_jobs: Specify the number of threads "" "

      

4.2 Logistic regression LR
From Sklearn.linear_model import logisticregression# defining logistic regression model = logisticregression (penalty= ' L2 ', Dual=false, tol= 0.0001, c=1.0,     fit_intercept=true, Intercept_scaling=1, Class_weight=none, Random_state=none     , solver= ' Liblinear ', max_iter=100, multi_class= ' OVR ',     verbose=0, Warm_start=false, N_jobs=1) "" "Parameter---    Penalty: Using the specified regularization item (default: L2)    dual:n_samples > N_features false (Default)    C: The inverse of the regularization strength, the smaller the value the greater the regularization intensity    n_jobs: Specify the number of threads    random_state: Random number generator    fit_intercept: Whether constant "" "is required
4.3 Naive Bayesian algorithm NB
From Sklearn import Naive_bayesmodel = Naive_bayes. GAUSSIANNB () # Gausbeyes model = Naive_bayes. MULTINOMIALNB (alpha=1.0, Fit_prior=true, class_prior=none) model = Naive_bayes. BERNOULLINB (alpha=1.0, binarize=0.0, Fit_prior=true, Class_prior=none) "" "Text classification problem common MULTINOMIALNB parameters---    Alpha: Smoothing parameter    Fit_prior: whether to learn a priori probability of a class, false-uses a uniform priori probability    Class_prior: Whether to specify a priori probability of a class, or to adjust binarize according to parameters if specified    : The threshold value of the binary, if none, the input is assumed to consist of a binary vector "" "
4.4 Decision Tree DT
From Sklearn import tree model = tree. Decisiontreeclassifier (criterion= ' Gini ', Max_depth=none,     min_samples_split=2, Min_samples_leaf=1, Min_weight_ fraction_leaf=0.0,     Max_features=none, Random_state=none, Max_leaf_nodes=none,     min_impurity_decrease=0.0, Min_impurity_split=none,     Class_weight=none, Presort=false) "" "Parameter---    criterion: Feature selection criteria gini/entropy    max _depth: The maximum depth of the tree, none-as far as possible    Min_samples_split: Split the internal node, the minimum required sample tree    Min_samples_leaf: leaf node required minimum number of samples    Max_ Features: Maximum feature number when searching for optimal split point    max_leaf_nodes: priority growth to maximum leaf node number    min_impurity_decrease: If this separation results in a reduction of impurities greater than or equal to this value, The node is split. """
4.5 Support Vector Machine SVM
From SKLEARN.SVM Import Svcmodel = SVC (c=1.0, kernel= ' RBF ', gamma= ' auto ') "" "Parameter---    C: The penalty parameter for error term C    gamma: nuclear correlation coefficient. Floating-point number, If gamma is ' auto ' then 1/n_features'll be used instead. ""
4.6 k Nearest Neighbor algorithm KNN
The Sklearn import neighbors# defines the KNN classification model = neighbors. Kneighborsclassifier (N_neighbors=5, N_jobs=1) # Classification model = neighbors. Kneighborsregressor (N_neighbors=5, N_jobs=1) # regression "" "Parameter---    n_neighbors: Number of neighbors    n_jobs: Number of parallel Tasks" ""
4.7 Multi-layer sensing machine (neural network)
The From Sklearn.neural_network import mlpclassifier# defines the multilayer perceptron classification Algorithm model = mlpclassifier (activation= ' Relu ', solver= ' Adam ', alpha=0.0001) "" "Parameter---    hidden_layer_sizes: ganso    activation: Activation function    Solver: Optimization algorithm {' Lbfgs ', ' sgd ', ' Adam '}    ALPHA:L2 penalty (regularization item) parameter. """
5. Model evaluation and Selection Chapter 5.1 cross-validation
From sklearn.model_selection import cross_val_scorecross_val_score (model, X, Y=none, Scoring=none, Cv=none, N_jobs=1) " "" Parameter---    model: Models for fitting data    Cv:k-fold    scoring: Scoring parameters-' accuracy ', ' F1 ', ' precision ', ' recall ', ' Roc_auc ', ' neg_ ' Log_loss ' et cetera ' ""
5.2 Test Curve

Using the test curve, we can change the model parameters more conveniently and get the model performance.

From sklearn.model_selection import validation_curvetrain_score, Test_score = Validation_curve (model, X, y, Param_name, Param_range, Cv=none, Scoring=none, N_jobs=1) "" "Parameter---    Model: Object X for fit and predict    , Y: Feature and label of training set    PARAM_ Name: Param_range of the parameter changed    :    cv:k-fold   return value---   train_score: Training set Score (array)    Test_score: Validation set score (array) "" "

Example

6. Save the Model

Finally, we can save our trained model to the local, or put it on the line for the user to use, then how to save the trained model? There are two main ways:

6.1 Save As Pickle file
Import pickle# Save model with open (' Model.pickle ', ' WB ') as F:    Pickle.dump (model, F) # Read model with open (' Model.pickle ', ' RB ') a s f:    model = Pickle.load (f) model.predict (X_test)
6.2 Sklearn Self-bringing method joblib
From sklearn.externals import joblib# save model Joblib.dump (model, ' model.pickle ') #载入模型model = joblib.load (' Model.pickle ')

Use of Sklearn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.