1. Load data (Loading)
Assuming the input is a feature matrix or CSV file, the data is first loaded into memory.
The Scikit-learn implementation uses the arrays in NumPy, so use NumPy to load the CSV file.
The following is data downloaded from the UCI machine Learning Data Warehouse.
#Data LoadingImportNumPy as NPImportUrllib#URL with DataSetURL ="Http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data "#Download the fileRaw_data =urllib.urlopen (URL)#load the CSV file as a numpy matrixDataSet = Np.loadtxt (Raw_data, delimiter =",")#seperate the data from the target attributesX = dataset[:, 0:7]y= dataset[:, 8]
2. Normalization of data (normalization)
Most of the gradient methods in machine learning algorithms are sensitive to the scale and scale of the data, and before starting the algorithm, we should normalize or standardize the process, which allows the feature data to be scaled to a range of 0-1. Scikit-learn provides a normalized approach.
# Data Normalization from Import preprocessing#normalize the data attributesnormalized_x = Preprocessing.normalize (X)#standardize the data attributesstandardized_x = Preprocessing.scale (X)
3. Feature Selection (Feature Selection)
In the process of solving a practical problem, it is particularly important to choose the right features or the ability to build features. This becomes characteristic choice or characteristic engineering.
Feature selection is a process that requires creativity, relies more on intuition and expertise, and has a lot of ready-made algorithms for feature selection.
The following tree algorithm (algorithms) calculates the amount of information for a feature:
# Feature Selection from Import Metrics from Import = extratreesclassifier () model.fit (X, y)#Display the relative importance of each AttributePrint(model.feature_importances_)
Results:
>>> runfile ('f:/hdn20160329/python/spyder/example2_sklearn_procedure/sklearn_procedure.py ', wdir='f:/hdn20160329/python/spyder/example2_sklearn_procedure' 0.12315529 0.25870914 0.11863867 0.08749797 0.08296516 0.1840623 0.14497146]
4. Use of the algorithm
Most of the problems can be attributed to the two-dollar classification problem. The advantage of this algorithm is that it can give the probability of the category of the data .
#Logistic regression fromSklearnImportMetrics fromSklearn.linear_modelImportLogisticregressionmodel=logisticregression () model.fit (X, y)Print(model)#Make predictionsexpected =ypredicted=model.predict (X)#summarize the Fit of the modelPrint(Metrics.classification_report (expected, predicted))Print(Metrics.confusion_matrix (expected, predicted))
Results:
Logisticregression (c=1.0, Class_weight=none, Dual=false, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='OVR', N_jobs=1, Penalty='L2', Random_state=none, solver='Liblinear', tol=0.0001, verbose=0, warm_start=False) Precision recall F1-Score Support0.0 0.79 0.89 0.84 500 1.0 0.74 0.55 0.63 268avg/Total 0.77 0.77 0.77 768[[447 53] [120 148]]
The task of this method is to restore the distribution density of training sample data, which has good effect in Multi-category classification .
# GAUSSIANNB from Import = gaussiannb () model.fit (X, y)print(model)#makepredicitions expected == model.predict (X)#summarize the fit of the modelprint (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))
Results:
GAUSSIANNB () Precision recall F1-score support 0.0 0.80 0.86 0.83 1.0 0.69 0.60 0.64 268/Total 0.76 0.77 0.76 768[[429 ] [108 160]]
K-Nearest neighbor algorithm is often used as a part of the classification algorithm, for example, it can be used to evaluate the feature, we can use it in feature selection .
# KNN from Import = kneighborsclassifier () model.fit (x, y)print== model.predict (x) Print(metrics.classification_report (expected, predicted))print( Metrics.confusion_matrix (expected, predicted))
Results:
kneighborsclassifier (Algorithm= " auto , leaf_size=30, Metric=" minkowski " , Metric_params =non E, N_jobs=1, n_neighbors=5, P=2 =" uniform ) precision Recall F1 -score support 0.0 0.82 0.90 0.86 1.0 0.77 0.63 0.69 268avg /Total 0.80 0.80 0.80 768[[ 448 52] [ 98]]
Classification and regression tree (classification and Regression Trees, CART) algorithms are commonly used in classification or regression problems where features contain categorical information , which is well suited for multi-classification situations.
# Decision Tree from Import = decisiontreeclassifier () model.fit (X, y)print== Model.predict (X)print(Metrics.classification_report (expected, predicted))Print (Metrics.confusion_matrix (expected, predicted))
Results:
Decisiontreeclassifier (Class_weight=none, criterion='Gini', max_depth=None, Max_features=none, Max_leaf_nodes=none, min_samples_leaf=1, Min_samples_split=2, min_weight_fraction_leaf=0.0, Presort=false, Random_state=none, splitter=' Best') Precision Recall F1-Score Support0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268avg/Total 1.00 1.00 1.00 768[[5000] [0268]]
SVM is a very popular machine learning algorithm, mainly used for classification problems, like logistic regression problem, it can use a one-to- many approach to Multi-category classification.
# SVM from Import = SVC () model.fit (x, y)print== model.predict (x)Print (Metrics.classification_report (expected, predicted))print(Metrics.confusion_matrix ( expected, predicted))
Results:
SVC (c=1.0, cache_size=200, Class_weight=none, Coef0=0.0 =none, degree=3, Gamma= " auto , Kernel=" RBF " , Max_iter =-1, probability =false, Random_state=none, Shrinking=true, tol =0.001, Verbose=false) Precision Recall F1 -score support 0.0 1 .1.00 1.00 1.0 1.00 1.00 1.00 268avg /Total 1.00 1.00 1.00 768[[ 500 0] [0 268]]
5. How to optimize the algorithm parameters
A more difficult task is to construct an effective method for selecting the correct parameters , and we need to use a search method to determine the parameters. Scikit-learn provides a function to achieve this goal.
The following example is a procedure for selecting a regular parameter:
#paramater Selection fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportGRIDSEARCHCV#prepare a range of alpha values to testAlphas = Np.array ([1, 0.1, 0.01, 0.001, 0.0001, 0])#Create and fit a ridge regression model, testing each alphaModel =Ridge () grid= GRIDSEARCHCV (Estimator = model, Param_grid = dict (alpha =Alphas)) Grid.fit (X, y)Print(GRID)#summarize the results of the grid searchPrint(grid.best_score_)Print(Grid.best_estimator_.alpha)
Results:
GRIDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, N_jobs=1, Param_grid={'Alpha': Array ([1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03, 1.00000E-04, 0.00000e+00])}, Pre_dispatch='2*n_jobs', Refit=true, Scoring=none, verbose=0)0.2821189556861.0
Sometimes it is very effective to select parameters randomly from a given interval, and then evaluate the effect of the algorithm based on these parameters to choose the best one.
fromScipy.statsImportUniform as Sp_rand fromSklearn.linear_modelImportRidge fromSklearn.grid_searchImportRANDOMIZEDSEARCHCV#prepare a uniform distribution to sample for the Alpha parameterParam_grid = {'Alpha': Sp_rand ()}model=Ridge () Rsearch= RANDOMIZEDSEARCHCV (Estimator = model, param_distributions = param_grid, N_iter = 100) Rsearch.fit (X, y)Print(Rsearch)Print(rsearch.best_score_)Print(Rsearch.best_estimator_.alpha)
Results:
RANDOMIZEDSEARCHCV (Cv=none, error_score='Raise', Estimator=ridge (alpha=1.0, Copy_x=true, Fit_intercept=true, max_iter=None, normalize=false, Random_state=none, solver='Auto', tol=0.001), Fit_params={}, Iid=true, n_iter=100, N_jobs=1, Param_distributions={'Alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000008739c18>}, Pre_dispatch='2*n_jobs', Random_state=none, refit=True, scoring=none, verbose=0)0.2821188969250.997818886895
Main modules and basic use of Scikit-learn