International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Main modules and basic use of Scikit-learn

Last Update:2016-04-07 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Load data (Loading)

Assuming the input is a feature matrix or CSV file, the data is first loaded into memory.

The Scikit-learn implementation uses the arrays in NumPy, so use NumPy to load the CSV file.
The following is data downloaded from the UCI machine Learning Data Warehouse.

#Data LoadingImportNumPy as NPImportUrllib#URL with DataSetURL ="Http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data "#Download the fileRaw_data =urllib.urlopen (URL)#load the CSV file as a numpy matrixDataSet = Np.loadtxt (Raw_data, delimiter =",")#seperate the data from the target attributesX = dataset[:, 0:7]y= dataset[:, 8]

2. Normalization of data (normalization)

Most of the gradient methods in machine learning algorithms are sensitive to the scale and scale of the data, and before starting the algorithm, we should normalize or standardize the process, which allows the feature data to be scaled to a range of 0-1. Scikit-learn provides a normalized approach.

# Data Normalization  from Import   preprocessing#normalize the data attributesnormalized_x =  Preprocessing.normalize (X)#standardize the data attributesstandardized_x = Preprocessing.scale (X)

3. Feature Selection (Feature Selection)

In the process of solving a practical problem, it is particularly important to choose the right features or the ability to build features. This becomes characteristic choice or characteristic engineering.
Feature selection is a process that requires creativity, relies more on intuition and expertise, and has a lot of ready-made algorithms for feature selection.
The following tree algorithm (algorithms) calculates the amount of information for a feature:

# Feature Selection  from Import Metrics  from Import  = extratreesclassifier () model.fit (X, y)#Display the relative importance of each AttributePrint(model.feature_importances_)

Results:

>>> runfile ('f:/hdn20160329/python/spyder/example2_sklearn_procedure/sklearn_procedure.py  ', wdir='f:/hdn20160329/python/spyder/example2_sklearn_procedure'  0.12315529  0.25870914  0.11863867  0.08749797  0.08296516  0.1840623  0.14497146]

4. Use of the algorithm

Logistic regression

Most of the problems can be attributed to the two-dollar classification problem. The advantage of this algorithm is that it can give the probability of the category of the data .

#Logistic regression fromSklearnImportMetrics fromSklearn.linear_modelImportLogisticregressionmodel=logisticregression () model.fit (X, y)Print(model)#Make predictionsexpected =ypredicted=model.predict (X)#summarize the Fit of the modelPrint(Metrics.classification_report (expected, predicted))Print(Metrics.confusion_matrix (expected, predicted))

Results:

Logisticregression (c=1.0, Class_weight=none, Dual=false, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='OVR', N_jobs=1, Penalty='L2', Random_state=none, solver='Liblinear', tol=0.0001, verbose=0, warm_start=False) Precision recall F1-Score Support0.0 0.79 0.89 0.84 500 1.0 0.74 0.55 0.63 268avg/Total 0.77 0.77 0.77 768[[447 53] [120 148]]

naive Bayesian

The task of this method is to restore the distribution density of training sample data, which has good effect in Multi-category classification .

# GAUSSIANNB  from Import  = gaussiannb () model.fit (X, y)print(model)#makepredicitions expected == model.predict (X)#summarize the fit of the modelprint  (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))

Results:

GAUSSIANNB ()             Precision    recall  F1-score        support 0.0       0.80      0.86      0.83        1.0       0.69      0.60      0.64       268/Total       0.76      0.77      0.76       768[[429  ] [108 160]]

K Nearest Neighbor

K-Nearest neighbor algorithm is often used as a part of the classification algorithm, for example, it can be used to evaluate the feature, we can use it in feature selection .

# KNN  from Import  = kneighborsclassifier () model.fit (x, y)print== model.predict (x)  Print(metrics.classification_report (expected, predicted))print( Metrics.confusion_matrix (expected, predicted))

Results:

 kneighborsclassifier (Algorithm= " auto  , leaf_size=30, Metric="  minkowski   " , Metric_params  =non E, N_jobs=1, n_neighbors=5, P=2 =" uniform          ) precision Recall F1 -score support  0.0 0.82 0.90 0.86 1.0 0.77 0.63 0.69 268avg /Total 0.80 0.80 0.80 768[[ 448 52] [ 98]]

Decision Tree

Classification and regression tree (classification and Regression Trees, CART) algorithms are commonly used in classification or regression problems where features contain categorical information , which is well suited for multi-classification situations.

# Decision Tree  from Import  = decisiontreeclassifier () model.fit (X, y)print==  Model.predict (X)print(Metrics.classification_report (expected, predicted))Print (Metrics.confusion_matrix (expected, predicted))

Results:

Decisiontreeclassifier (Class_weight=none, criterion='Gini', max_depth=None, Max_features=none, Max_leaf_nodes=none, min_samples_leaf=1, Min_samples_split=2, min_weight_fraction_leaf=0.0, Presort=false, Random_state=none, splitter=' Best') Precision Recall F1-Score Support0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268avg/Total 1.00 1.00 1.00 768[[5000] [0268]]

Svm

SVM is a very popular machine learning algorithm, mainly used for classification problems, like logistic regression problem, it can use a one-to- many approach to Multi-category classification.

# SVM  from Import  = SVC () model.fit (x, y)print== model.predict (x)Print  (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))

Results:

 SVC (c=1.0, cache_size=200, Class_weight=none, Coef0=0.0 =none, degree=3, Gamma= " auto  , Kernel="  RBF   " , Max_iter  =-1, probability =false, Random_state=none, Shrinking=true, tol  =0.001, Verbose=false) Precision Recall F1 -score support  0.0 1 .1.00 1.00 1.0 1.00 1.00 1.00 268avg /Total 1.00 1.00 1.00 768[[ 500 0] [0  268]]

5. How to optimize the algorithm parameters

A more difficult task is to construct an effective method for selecting the correct parameters , and we need to use a search method to determine the parameters. Scikit-learn provides a function to achieve this goal.

The following example is a procedure for selecting a regular parameter:

#paramater Selection fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportGRIDSEARCHCV#prepare a range of alpha values to testAlphas = Np.array ([1, 0.1, 0.01, 0.001, 0.0001, 0])#Create and fit a ridge regression model, testing each alphaModel =Ridge () grid= GRIDSEARCHCV (Estimator = model, Param_grid = dict (alpha =Alphas)) Grid.fit (X, y)Print(GRID)#summarize the results of the grid searchPrint(grid.best_score_)Print(Grid.best_estimator_.alpha)

Results:

GRIDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, N_jobs=1, Param_grid={'Alpha': Array ([1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,         1.00000E-04, 0.00000e+00])}, Pre_dispatch='2*n_jobs', Refit=true, Scoring=none, verbose=0)0.2821189556861.0

Sometimes it is very effective to select parameters randomly from a given interval, and then evaluate the effect of the algorithm based on these parameters to choose the best one.

 fromScipy.statsImportUniform as Sp_rand fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportRANDOMIZEDSEARCHCV#prepare a uniform distribution to sample for the Alpha parameterParam_grid = {'Alpha': Sp_rand ()}model=Ridge () Rsearch= RANDOMIZEDSEARCHCV (Estimator = model, param_distributions = param_grid, N_iter = 100) Rsearch.fit (X, y)Print(Rsearch)Print(rsearch.best_score_)Print(Rsearch.best_estimator_.alpha)

Results:

RANDOMIZEDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, n_iter=100, N_jobs=1, Param_distributions={'Alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000008739c18>}, Pre_dispatch='2*n_jobs', Random_state=none, refit=True, scoring=none, verbose=0)0.2821188969250.997818886895

Main modules and basic use of Scikit-learn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

use of and while use of if and main name should use millis and main hands on machine learning with scikit learn and tensorflow pdf hands on machine learning with scikit learn and tensorflow amazon hands on machine learning with scikit learn and tensorflow ebook

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Main modules and basic use of Scikit-learn

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support