Python Machine Learning Toolkit Scikit-learn

Source: Internet
Author: User
Tags svm stock prices rbf kernel

Scikit-learn this very powerful Python machine learning Toolkit

Http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

S1. Import data

Most of the data is formatted as M n-dimensional vectors, divided into training sets and test sets. So, knowing how to import vector (matrix) data is the most critical point. We need to use NumPy to help. Suppose the data format is:

Stock Prices Indicator1 Indicator2

2.0 123 1252

1.0.. ..

..             . .

.

Import Code Reference:

Import NumPy as NP

f = open ("filename.txt")

F.readline () # Skip the header

data = Np.loadtxt (f)

X = data[:, 1:] # Select columns 1 through end

y = data[:, 0] # Select column 0, the stock price

Data import in LIBSVM format:

>>> from sklearn.datasets import load_svmlight_file

>>> X_train, Y_train = Load_svmlight_file ("/path/to/train_dataset.txt")

...

>>>x_train.todense () #将稀疏矩阵转化为完整特征矩阵

More format data import and build reference: http://scikit-learn.org/stable/datasets/index.html

S2. Supervised classification several common methods:

Logistic Regression

>>> from Sklearn.linear_model import logisticregression

>>> clf2 = Logisticregression (). Fit (X, y)

>>> CLF2

Logisticregression (c=1.0, Intercept_scaling=1, Dual=false, Fit_intercept=true,

Penalty= ' L2 ', tol=0.0001)

>>> Clf2.predict_proba (x_new)

Array ([[[9.07512928e-01, 9.24770379e-02, 1.00343962e-05]])

Linear SVM (Linear kernel)

>>> from SKLEARN.SVM import linearsvc

>>> CLF = Linearsvc ()

>>> Clf.fit (X, Y)

>>> x_new = [[5.0, 3.6, 1.3, 0.25]]

>>> clf.predict (x_new) #reuslt [0] if class label

Array ([0], Dtype=int32)

SVM (RBF or other kernel)

>>> from Sklearn import SVM

>>> CLF = SVM. SVC ()

>>> Clf.fit (X, Y)

SVC (c=1.0, cache_size=200, Class_weight=none, coef0=0.0, degree=3,

gamma=0.0, kernel= ' RBF ', Probability=false, Shrinking=true, tol=0.001,

Verbose=false)

>>> Clf.predict ([[2., 2.]])

Array ([1.])

Naive Bayes (Gaussian likelihood)

From Sklearn.naive_bayes import GAUSSIANNB

>>> from Sklearn import datasets

>>> GNB = GAUSSIANNB ()

>>> GNB = Gnb.fit (x, y)

>>> gnb.predict (xx) #result [0] is the most likely class label

Decision Tree (Classification not regression)

>>> from Sklearn import tree

>>> CLF = tree. Decisiontreeclassifier ()

>>> CLF = Clf.fit (X, Y)

>>> Clf.predict ([[2., 2.]])

Array ([1.])

Ensemble (Random forests, classification not regression)

>>> from sklearn.ensemble import randomforestclassifier

>>> CLF = Randomforestclassifier (n_estimators=10)

>>> CLF = Clf.fit (X, Y)

>>> clf.predict (x_test)

S3. Model Selection (cross-validation)

Hand-training data and testing data can of course, but the more convenient way is automatic, Scikit-learn also has the relevant functions, here the Cross-validation code is recorded:

>>> from Sklearn import cross_validation

>>> from Sklearn import SVM

>>> CLF = SVM. SVC (kernel= ' linear ', c=1)

>>> scores = Cross_validation.cross_val_score (CLF, Iris.data, Iris.target, cv=5) #5-fold CV

#change Metrics

>>> from Sklearn Import metrics

>>> Cross_validation.cross_val_score (CLF, Iris.data, Iris.target, cv=5, Score_func=metrics.f1_score)

#f1 Score:http://en.wikipedia.org/wiki/f1_score

More about cross-validation:http://scikit-learn.org/stable/modules/cross_validation.html

Note:if using LR, CLF = Logisticregression ().

S4. Sign Prediction Experiment

The dataset, Epinions, has the trust and distrust relationship between user and user, and interaction (which points to the usefulness of user reviews).

Features: Network topology feature reference "Predict positive and negative links in online social network", user interaction information feature.

A total of 3 categories of instances, 3 training + tests per class, training data is 10 times times the test data, ~80,000 a 29/5/34 vector, draw the following conclusions. Time, GNB fastest (all instance are 2-3 seconds to run out), DT very fast (there is a class of instance only 1 seconds, others to 4 seconds), LR very quickly (three classes of instance time is 2 seconds, 5 seconds, ~30 seconds), RF is not slow (one instance9 seconds, the other 26 seconds), linear kernel SVM is several times slower than LR (all instance to run more than 30 seconds), the RBF kernel SVM is slower than linear SVM 20+ Times to a hundredfold (the first instance to 11 minutes, the second instance ran for nearly two hours). Accuracy on RF>LR>DT>GNB>SVM (RBF kernel) >SVM (Linear kernel). GNB and SVM (linear kernel), SVM (RBF kernel) in the second class instance the difference is far (10%-20%), LR, DT are similar, RF does reflect the ensemble method of the powerful, than LR has Significantly improved (nearly 2%-4%). (Note: Due to the submission of this article, the RBF version of SVM ran out of two instance in a single test, the above results are based only on this.) In addition, I also tried the methods such as SGD, the overall is not particularly ideal, do not remember it. On the effectiveness of feature, user interaction feature is more effective than the network topology feature five to 10 of percent.

S5. Common Test Source code

This is what I wrote. Using the automatic classification of many algorithms including the above algorithm and 10fold cross-validation python code, as long as the input file retains the format described at the beginning of this article (and does not contain comment information), you can use a variety of different algorithms to test the classification effect.

Python Machine Learning Toolkit Scikit-learn

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.