Scikit-learn this very powerful Python machine learning Toolkit
Http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
S1. Import data
Most of the data is formatted as M n-dimensional vectors, divided into training sets and test sets. So, knowing how to import vector (matrix) data is the most critical point. We need to use NumPy to help. Suppose the data format is:
Stock Prices Indicator1 Indicator2 2.0 123 1252 1.0.. .. .. . . . |
Import Code Reference:
Import NumPy as NP f = open ("filename.txt") F.readline () # Skip the header data = Np.loadtxt (f) X = data[:, 1:] # Select columns 1 through end y = data[:, 0] # Select column 0, the stock price |
Data import in LIBSVM format:
>>> from sklearn.datasets import load_svmlight_file >>> X_train, Y_train = Load_svmlight_file ("/path/to/train_dataset.txt") ... >>>x_train.todense () #将稀疏矩阵转化为完整特征矩阵 |
More format data import and build reference: http://scikit-learn.org/stable/datasets/index.html
S2. Supervised classification several common methods:
Logistic Regression
>>> from Sklearn.linear_model import logisticregression >>> clf2 = Logisticregression (). Fit (X, y) >>> CLF2 Logisticregression (c=1.0, Intercept_scaling=1, Dual=false, Fit_intercept=true, Penalty= ' L2 ', tol=0.0001) >>> Clf2.predict_proba (x_new) Array ([[[9.07512928e-01, 9.24770379e-02, 1.00343962e-05]]) |
Linear SVM (Linear kernel)
>>> from SKLEARN.SVM import linearsvc >>> CLF = Linearsvc () >>> Clf.fit (X, Y) >>> x_new = [[5.0, 3.6, 1.3, 0.25]] >>> clf.predict (x_new) #reuslt [0] if class label Array ([0], Dtype=int32) |
SVM (RBF or other kernel)
>>> from Sklearn import SVM >>> CLF = SVM. SVC () >>> Clf.fit (X, Y) SVC (c=1.0, cache_size=200, Class_weight=none, coef0=0.0, degree=3, gamma=0.0, kernel= ' RBF ', Probability=false, Shrinking=true, tol=0.001, Verbose=false) >>> Clf.predict ([[2., 2.]]) Array ([1.]) |
Naive Bayes (Gaussian likelihood)
From Sklearn.naive_bayes import GAUSSIANNB >>> from Sklearn import datasets >>> GNB = GAUSSIANNB () >>> GNB = Gnb.fit (x, y) >>> gnb.predict (xx) #result [0] is the most likely class label |
Decision Tree (Classification not regression)
>>> from Sklearn import tree >>> CLF = tree. Decisiontreeclassifier () >>> CLF = Clf.fit (X, Y) >>> Clf.predict ([[2., 2.]]) Array ([1.]) |
Ensemble (Random forests, classification not regression)
>>> from sklearn.ensemble import randomforestclassifier >>> CLF = Randomforestclassifier (n_estimators=10) >>> CLF = Clf.fit (X, Y) >>> clf.predict (x_test) |
S3. Model Selection (cross-validation)
Hand-training data and testing data can of course, but the more convenient way is automatic, Scikit-learn also has the relevant functions, here the Cross-validation code is recorded:
>>> from Sklearn import cross_validation >>> from Sklearn import SVM >>> CLF = SVM. SVC (kernel= ' linear ', c=1) >>> scores = Cross_validation.cross_val_score (CLF, Iris.data, Iris.target, cv=5) #5-fold CV #change Metrics >>> from Sklearn Import metrics >>> Cross_validation.cross_val_score (CLF, Iris.data, Iris.target, cv=5, Score_func=metrics.f1_score) #f1 Score:http://en.wikipedia.org/wiki/f1_score |
More about cross-validation:http://scikit-learn.org/stable/modules/cross_validation.html
Note:if using LR, CLF = Logisticregression ().
S4. Sign Prediction Experiment
The dataset, Epinions, has the trust and distrust relationship between user and user, and interaction (which points to the usefulness of user reviews).
Features: Network topology feature reference "Predict positive and negative links in online social network", user interaction information feature.
A total of 3 categories of instances, 3 training + tests per class, training data is 10 times times the test data, ~80,000 a 29/5/34 vector, draw the following conclusions. Time, GNB fastest (all instance are 2-3 seconds to run out), DT very fast (there is a class of instance only 1 seconds, others to 4 seconds), LR very quickly (three classes of instance time is 2 seconds, 5 seconds, ~30 seconds), RF is not slow (one instance9 seconds, the other 26 seconds), linear kernel SVM is several times slower than LR (all instance to run more than 30 seconds), the RBF kernel SVM is slower than linear SVM 20+ Times to a hundredfold (the first instance to 11 minutes, the second instance ran for nearly two hours). Accuracy on RF>LR>DT>GNB>SVM (RBF kernel) >SVM (Linear kernel). GNB and SVM (linear kernel), SVM (RBF kernel) in the second class instance the difference is far (10%-20%), LR, DT are similar, RF does reflect the ensemble method of the powerful, than LR has Significantly improved (nearly 2%-4%). (Note: Due to the submission of this article, the RBF version of SVM ran out of two instance in a single test, the above results are based only on this.) In addition, I also tried the methods such as SGD, the overall is not particularly ideal, do not remember it. On the effectiveness of feature, user interaction feature is more effective than the network topology feature five to 10 of percent.
S5. Common Test Source code
This is what I wrote. Using the automatic classification of many algorithms including the above algorithm and 10fold cross-validation python code, as long as the input file retains the format described at the beginning of this article (and does not contain comment information), you can use a variety of different algorithms to test the classification effect.
Python Machine Learning Toolkit Scikit-learn