Main modules and basic use of Scikit-learn

Source: Internet
Author: User
Tags svm

1. Load data (Loading)

Assuming the input is a feature matrix or CSV file, the data is first loaded into memory.

The Scikit-learn implementation uses the arrays in NumPy, so use NumPy to load the CSV file.
The following is data downloaded from the UCI machine Learning Data Warehouse.

#Data LoadingImportNumPy as NPImportUrllib#URL with DataSetURL ="Http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data "#Download the fileRaw_data =urllib.urlopen (URL)#load the CSV file as a numpy matrixDataSet = Np.loadtxt (Raw_data, delimiter =",")#seperate the data from the target attributesX = dataset[:, 0:7]y= dataset[:, 8]
2. Normalization of data (normalization)

Most of the gradient methods in machine learning algorithms are sensitive to the scale and scale of the data, and before starting the algorithm, we should normalize or standardize the process, which allows the feature data to be scaled to a range of 0-1. Scikit-learn provides a normalized approach.

# Data Normalization  from Import   preprocessing#normalize the data attributesnormalized_x =  Preprocessing.normalize (X)#standardize the data attributesstandardized_x = Preprocessing.scale (X)
3. Feature Selection (Feature Selection)

In the process of solving a practical problem, it is particularly important to choose the right features or the ability to build features. This becomes characteristic choice or characteristic engineering.
Feature selection is a process that requires creativity, relies more on intuition and expertise, and has a lot of ready-made algorithms for feature selection.
The following tree algorithm (algorithms) calculates the amount of information for a feature:

# Feature Selection  from Import Metrics  from Import  = extratreesclassifier () model.fit (X, y)#Display the relative importance of each AttributePrint(model.feature_importances_)

Results:

>>> runfile ('f:/hdn20160329/python/spyder/example2_sklearn_procedure/sklearn_procedure.py  ', wdir='f:/hdn20160329/python/spyder/example2_sklearn_procedure'  0.12315529  0.25870914  0.11863867  0.08749797  0.08296516  0.1840623  0.14497146]

4. Use of the algorithm
    • Logistic regression

Most of the problems can be attributed to the two-dollar classification problem. The advantage of this algorithm is that it can give the probability of the category of the data .

#Logistic regression fromSklearnImportMetrics fromSklearn.linear_modelImportLogisticregressionmodel=logisticregression () model.fit (X, y)Print(model)#Make predictionsexpected =ypredicted=model.predict (X)#summarize the Fit of the modelPrint(Metrics.classification_report (expected, predicted))Print(Metrics.confusion_matrix (expected, predicted))

Results:

Logisticregression (c=1.0, Class_weight=none, Dual=false, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='OVR', N_jobs=1, Penalty='L2', Random_state=none, solver='Liblinear', tol=0.0001, verbose=0, warm_start=False) Precision recall F1-Score Support0.0 0.79 0.89 0.84 500 1.0 0.74 0.55 0.63 268avg/Total 0.77 0.77 0.77 768[[447 53] [120 148]]
    • naive Bayesian

The task of this method is to restore the distribution density of training sample data, which has good effect in Multi-category classification .

# GAUSSIANNB  from Import  = gaussiannb () model.fit (X, y)print(model)#makepredicitions expected == model.predict (X)#summarize the fit of the modelprint  (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))

Results:

GAUSSIANNB ()             Precision    recall  F1-score        support 0.0       0.80      0.86      0.83        1.0       0.69      0.60      0.64       268/Total       0.76      0.77      0.76       768[[429  ] [108 160]]
    • K Nearest Neighbor

K-Nearest neighbor algorithm is often used as a part of the classification algorithm, for example, it can be used to evaluate the feature, we can use it in feature selection .

# KNN  from Import  = kneighborsclassifier () model.fit (x, y)print== model.predict (x)  Print(metrics.classification_report (expected, predicted))print( Metrics.confusion_matrix (expected, predicted))

Results:

 kneighborsclassifier (Algorithm= " auto  , leaf_size=30, Metric="  minkowski   " , Metric_params  =non E, N_jobs=1, n_neighbors=5, P=2 =" uniform          ) precision Recall F1 -score support  0.0 0.82 0.90 0.86 1.0 0.77 0.63 0.69 268avg /Total 0.80 0.80 0.80 768[[ 448 52] [ 98]] 
    • Decision Tree

Classification and regression tree (classification and Regression Trees, CART) algorithms are commonly used in classification or regression problems where features contain categorical information , which is well suited for multi-classification situations.

# Decision Tree  from Import  = decisiontreeclassifier () model.fit (X, y)print==  Model.predict (X)print(Metrics.classification_report (expected, predicted))Print (Metrics.confusion_matrix (expected, predicted))

Results:

Decisiontreeclassifier (Class_weight=none, criterion='Gini', max_depth=None, Max_features=none, Max_leaf_nodes=none, min_samples_leaf=1, Min_samples_split=2, min_weight_fraction_leaf=0.0, Presort=false, Random_state=none, splitter=' Best') Precision Recall F1-Score Support0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268avg/Total 1.00 1.00 1.00 768[[5000] [0268]]
    • Svm

SVM is a very popular machine learning algorithm, mainly used for classification problems, like logistic regression problem, it can use a one-to- many approach to Multi-category classification.

# SVM  from Import  = SVC () model.fit (x, y)print== model.predict (x)Print  (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))

Results:

 SVC (c=1.0, cache_size=200, Class_weight=none, Coef0=0.0 =none, degree=3, Gamma= " auto  , Kernel="  RBF   " , Max_iter  =-1, probability =false, Random_state=none, Shrinking=true, tol  =0.001, Verbose=false) Precision Recall F1 -score support  0.0 1 .1.00 1.00 1.0 1.00 1.00 1.00 268avg /Total 1.00 1.00 1.00 768[[ 500 0] [0  268]] 
5. How to optimize the algorithm parameters

A more difficult task is to construct an effective method for selecting the correct parameters , and we need to use a search method to determine the parameters. Scikit-learn provides a function to achieve this goal.

The following example is a procedure for selecting a regular parameter:

#paramater Selection fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportGRIDSEARCHCV#prepare a range of alpha values to testAlphas = Np.array ([1, 0.1, 0.01, 0.001, 0.0001, 0])#Create and fit a ridge regression model, testing each alphaModel =Ridge () grid= GRIDSEARCHCV (Estimator = model, Param_grid = dict (alpha =Alphas)) Grid.fit (X, y)Print(GRID)#summarize the results of the grid searchPrint(grid.best_score_)Print(Grid.best_estimator_.alpha)

Results:

GRIDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, N_jobs=1, Param_grid={'Alpha': Array ([1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,         1.00000E-04, 0.00000e+00])}, Pre_dispatch='2*n_jobs', Refit=true, Scoring=none, verbose=0)0.2821189556861.0

Sometimes it is very effective to select parameters randomly from a given interval, and then evaluate the effect of the algorithm based on these parameters to choose the best one.

 fromScipy.statsImportUniform as Sp_rand fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportRANDOMIZEDSEARCHCV#prepare a uniform distribution to sample for the Alpha parameterParam_grid = {'Alpha': Sp_rand ()}model=Ridge () Rsearch= RANDOMIZEDSEARCHCV (Estimator = model, param_distributions = param_grid, N_iter = 100) Rsearch.fit (X, y)Print(Rsearch)Print(rsearch.best_score_)Print(Rsearch.best_estimator_.alpha)

Results:

RANDOMIZEDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, n_iter=100, N_jobs=1, Param_distributions={'Alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000008739c18>}, Pre_dispatch='2*n_jobs', Random_state=none, refit=True, scoring=none, verbose=0)0.2821188969250.997818886895

Main modules and basic use of Scikit-learn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.