Original: http://blog.csdn.net/zouxy09/article/details/48903179
I. Overview
Machine learning algorithms In recent years, the heat of the big data ignited has become "well known", even if you do not know the algorithm theory, call you one or two famous algorithm name, you can also head up and blurt out. Of course, although the algorithm of the forest is large, but can be limited, can adapt to certain circumstances and achieve better results of the algorithm will stand out, and the performance of the average person is forgotten by history. With the development of the machine learning community and the validation of practice, this group of winners has gradually been recognized and favored, while gaining more community power support, improvement and promotion.
Taking the most extensive classification algorithm as an example, it can be divided into two major factions: linear and nonlinear. The linear algorithm has the famous logistic regression, naive Bayes, maximum entropy and so on, the nonlinear algorithm has the stochastic forest, the decision tree, the neural network, the nuclear machine and so on. The big flag of the linear algorithm is the higher efficiency of training and prediction, but the final effect is more dependent on the feature, and the data is linearly divided on the characteristic level. Therefore, the use of linear algorithm requires a lot of work on feature engineering, as far as possible to select features, transformations or combinations so that the characteristics of the distinction. But the nonlinear algorithm is a good point, can model complex classification surface, so as to better fit the data.
Which machine learning algorithm can achieve better results on the basis of our choice of features? No one knows it. Practice is to test which is the best standard. Is it hard to write five or six machine learning code? No, the power of the machine learning community is strong, and the consensus of the yards and farmers is not to reinvent the wheel! Therefore, for some of the more mature algorithms, there are always some excellent libraries can be used directly, eliminating the majority of research time.
Based on the current use of Python more, and the Python world is known for the machine learning library to count Scikit-learn. This library has many advantages. Easy to use, interface abstraction is very good, and document support is really moving. In this article, we can encapsulate many of these machine learning algorithms, and then perform a one-time test to facilitate analysis and optimization. Of course, for the specific algorithm, the super-parameter tuning is also very important.
Ii. Scikit-learn's Python Practice
2.1, the preparation of the Python work
One of the most popular points in Python is that the community supports a lot and there are very many good libraries or modules. But sometimes dependencies exist between libraries, so installing them is a tedious process. But there are people who can not stand this cumbersome, will develop a lot of automated tools to save you sir time. Among them, the personal summary, install a Python library has the following three ways:
1) Anaconda
This is a very complete Python release, the latest version offers up to 195 popular Python packages, including our commonly used NumPy, scipy, and other scientific computing packages. With it, mom never had to worry about me. Install one after another dependent package. Anaconda in hand, easy I have! as follows: Http://www.continuum.io/downloads
2) Pip
People who have used Ubuntu have only their own understanding of Apt-get's love. In fact, the download and installation of the Python library can be used with the PIP tool. What libraries need to be installed, directly download and install one-stop service. Download and install the HTTPS://PYPI.PYTHON.ORG/PYPI/PIP on the PIP website. The future needs are in the #pip install XX.
3) Source Package
If the above two methods do not find your library, then you directly download the library source, unzip, and then in the directory will have a setup.py file. You can install this library into Python's default library directory by executing the #python setup.py installation.
2.2, the Scikit-learn test
The Scikit-learn is already included in the Anaconda. can also be installed in the official download source package. This code encapsulates the following machine learning algorithm, we modify the data load function, you can test one click:
classifiers = {' NB ': naive_bayes_classifier, ' KNN ': knn_classifier, ' LR ': Logistic_regression_classifier, ' RF ': random_forest_classifier, ' DT ':d ecision_tree_classifier, ' SVM ': svm_classifier, ' SVMCV ': Svm_cross_validation, ' GBDT ': Gradient_boosting_classifier }
train_test.py
#!usr/bin/env python#-*-coding:utf-8-*-import sysimport osimport timefrom sklearn import metricsimport numpy as Npimpor T cpickle as Picklereload (SYS) sys.setdefaultencoding (' UTF8 ') # multinomial Naive Bayes classifierdef Naive_bayes_ Classifier (train_x, train_y): From sklearn.naive_bayes import MULTINOMIALNB model = MULTINOMIALNB (alpha=0.01) mod El.fit (train_x, train_y) return model# KNN classifierdef knn_classifier (train_x, train_y): From Sklearn.neighbors im Port Kneighborsclassifier model = Kneighborsclassifier () model.fit (train_x, train_y) return model# Logistic Regre Ssion classifierdef logistic_regression_classifier (train_x, train_y): From Sklearn.linear_model Import Logisticregression model = logisticregression (penalty= ' L2 ') Model.fit (train_x, train_y) return model# Random Fore St Classifierdef Random_forest_classifier (train_x, train_y): From sklearn.ensemble import randomforestclassifier mod el = randomforestclassifier (n_estimators=8) modEl.fit (train_x, train_y) return model# decision Tree Classifierdef Decision_tree_classifier (train_x, train_y): From Sklearn Import Tree model = tree. Decisiontreeclassifier () Model.fit (train_x, train_y) return model# GBDT (Gradient boosting decision Tree) Classifierd EF Gradient_boosting_classifier (train_x, train_y): From sklearn.ensemble import gradientboostingclassifier model = G Radientboostingclassifier (n_estimators=200) model.fit (train_x, train_y) return model# SVM classifierdef Svm_classifi ER (train_x, train_y): From SKLEARN.SVM Import svc model = svc (kernel= ' RBF ', probability=true) Model.fit (train_x, train_y) return model# SVM Classifier using cross validationdef svm_cross_validation (train_x, train_y): from Sklearn . Grid_search import GRIDSEARCHCV from SKLEARN.SVM Import svc model = svc (kernel= ' RBF ', probability=true) param_gr id = {' C ': [1e-3, 1e-2, 1e-1, 1, ten, +, +], ' gamma ': [0.001, 0.0001]} Grid_search = GRIDSEARCHCV (model, Param_grid, n_jobs = 1, verbose=1) grid_search.fit (train_x, train_y) best_parameters = Grid_search.best_estimator_ . Get_params () for Para, Val in Best_parameters.items (): Print para, Val model = SVC (kernel= ' RBF ', C=best_para meters[' C '], gamma=best_parameters[' Gamma '], probability=true) model.fit (train_x, train_y) return modeldef Read_data (data_file): import gzip f = Gzip.open (Data_file, "RB") train, val, test = Pickle.load (f) f.close () train_x = train[0] train_y = train[1] test_x = test[0] test_y = test[1] Return train_x, train_y, test_x, test_y if __name__ = = ' __main__ ': data_file = "mnist.pkl.gz" Thresh = 0.5 Model_save_file = None Model_save = {} Test_classifiers = [' nb ', ' KNN ', ' LR ', ' RF ', ' DT ', ' SVM ', ' gbdt '] classifiers = {' NB ': Naive_bayes_classifier, ' KNN ': knn_classifier, ' LR ': logistic_regression_classifier, ' RF ': random_fores T_classifier, ' DT ':d ecision_tree_classifier, ' SVM ': svm_classifier, ' SVMCV ': svm_cross_validation, ' GBDT ': gradient_boosting_classifier} print ' reading training and testing data ... ' train_x, Train_y, test_x, test_y = Read_data (data_file) num_train, num_feat = Train_x.shape num_test, num_feat = Test_x.shape Is_binary_class = (len (np.unique (train_y)) = = 2) print ' ******************** Data Info ********************* ' PR int ' #training data:%d, #testing_data:%d, dimension:%d '% (Num_train, num_test, num_feat) for classifier in test _classifiers:print ' *******************%s ******************** '% classifier start_time = Time.time () Model = Classifiers[classifier] (train_x, train_y) print ' training took%fs! '% (Time.time ()-start_time) predict = Model.predict (test_x) if model_save_file! = None:model_save[classifier] = Model if is _binary_class: Precision = Metrics.precision_score (test_y, predict) recall = Metrics.recall_score (test_y, predict) print ' precision:%.2f%%, recall:%.2f%% '% (* precision, recall) accuracy = Metrics.accuracy_score (tes T_y, predict) print ' accuracy:%.2f%% '% (accuracy) if model_save_file! = None:pickle.dump (model_ Save, open (Model_save_file, ' WB '))
Iv. test Results
This time using Mnist handwriting library to experiment: http://deeplearning.net/data/mnist/mnist.pkl.gz. A total of 50,000 training samples and 10,000 test samples.
The result of the code operation is as follows:
Reading training and testing data...******************** data Info ********************* #training data:50000, #testing_ data:10000, dimension:784******************* NB ********************training took 0.287000s!accuracy:83.69%******** KNN ********************training took 31.991000s!accuracy:96.64%******************* LR ***************** Training took 101.282000s!accuracy:91.99%******************* RF ********************training took 5.442000s! accuracy:93.78%******************* DT ********************training took 28.326000s!accuracy:87.23%**************** SVM ********************training took 3152.369000s!accuracy:94.35%******************* GBDT ******************** Training took 7623.761000s!accuracy:96.18%
In this data set, because the cluster of data distribution is better (if you understand this database, see its T-sne map can be seen.) Since the task is simple, it has been considered a toy dataset in the deep learning boundary, so KNN has a good effect. GBDT is a very good algorithm, in Kaggle and other big Data competition, the top tan Hua runner of the column can often see its figure. Three Stooges equals, or is proven reasonable, especially Three Stooges also ability to complement the time!
There is also a very effective method in practice, which is to fuse these classifiers and then make the decision. For example, a simple vote, the effect is very good. Suggestions in practice, everyone can try.
Python Machine Learning Library Scikit-learn Practice