mllib-Classification and regression

Last Update:2015-01-13 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Mllib supports a variety of methods for two classification, polyphenols and regression analysis, as follows:

Problem category	Support methods
Two categories	Linear support vector machines, logistic regression, decision trees, naive Bayes
Multi-classification	Decision Tree, Naive Bayes
Regression	Linear least squares, lasso,ridge regression, decision Trees

Linear model
- Two classifications (support vector machines, logistic regression)
- Linear regression (least squares, Lasso, ridge)
Decision Tree
Naive Bayesian

Linear model

Mathematical formulas
- Loss function
- Regularization
- Optimized
Two categories
- Linear Support Vector Machine
- Logistic regression
- Evaluation matrix
- Example
Linear least squares, lasso,ridgeregression
- Example
Flow-Type linear regression
- Example
Execute (Developer)

Mathematical formulas

Loss function

Regularization

Optimized

Two categories

The purpose of the two-classification problem is to divide the project into two categories: positive and negative examples. Mllib supports two two classification methods: linear support vector machines and logistic regression. Both of these methods support the two regular variables of L1 and L2. The training data is expressed as the labeledpoint format of the mllib. Note: In mathematical formulas, the training Mark Y is expressed as +1 (positive example) or 1 (negative example), which is for convenience of the formula representation. However, in Mllib in order to be compatible with multi-classification cases, the negative example is represented by 0 instead of-1.

Linear support vector Machine (SVMs)

Linear support vector Machine (SVM) is a standard method for large-scale classification problems. The linear representation is as follows

\ (L (w;x,y): =\max\{0, 1-yw^{t}x\}\)

By default, use L2 regularization. L1 is optional. In this way, the problem is programmed with a linear program.

The linear SVM algorithm outputs a SVM model. Given a new data point, represented by \ (x\), the model predicts based on the value of \ (w^{t}x\). By default, if the \ (w^{t}x \geq 0\) output is positive, negative values are otherwise.

Logistic regression

Logistic regression is widely used for two problems. The linear representation is as follows, using the logistic loss function:

\ (L (w;x,y): = Log (1+ exp (-yw^{t}x)) \)

The logistic regression algorithm outputs a logistic regression model. Given a new data point, denoted by \ (x\), the model uses the following loss function to make predictions

\ (f (z) =\frac{1}{1+e^{-z}} \)

Here \ (z=w^{t}x \). By default, if \ (f (w^{t}x) > 0.5 \), the output is positive and vice versa. Unlike linear SVM, the original output of the logistic regression model, \ (f (z) \), has a probability interpretation (for example; \ (x\) is a positive probability)

Evaluation matrix

Mllib supports commonly used two classification evaluation matrices (Pyspark not supported). Includes accuracy, recall rate, F-value, receiver operating characteristic (ROC), accuracy rate of recall curve, and area under the Curvers (AUC). AUC is a common model performance comparison method used to help users select a predictive threshold value (http://www.douban.com/note/284051363/?type=like) with an accuracy/recall/F value.

Example

The following code shows how to load a simple data set, train the data using the static method on the algorithm object, and use the resulting model to make predictions and calculate the training error rate.

Importorg.apache.spark._ImportOrg.apache.spark.mllib.classification.SVMWithSGDImportOrg.apache.spark.mllib.evaluation.BinaryClassificationMetricsImportorg.apache.spark.mllib.util.MLUtils/*** Created by ******qin on 2015-1-13.*/Object CLASSIFYSVM {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("CLASSIFYSVM")Val SC=Newsparkcontext (conf)//load training data in LIBSVM formatVal data = Mlutils.loadlibsvmfile (SC, args (0))//args (0)println (Data.count ())//split data into training and test (6:4)Val splits = Data.randomsplit (Array (0.6, 0.4), seed = 11L) Val Training= Splits (0). Cache () Val Test= Splits (1)    //Run training algorithm to build the modelVal numiterations = 100Val Model=Svmwithsgd.train (Training, numiterations)//Clean the default thresholdModel.clearthreshold ()//compute RAW Scores on the test setVal scoreandlabels = Test.map {point = =Val Score=model.predict (Point.features) (Score, Point.label)}//Get evaluation MetricsVal metrics =NewBinaryclassificationmetrics (scoreandlabels) Val Auroc=Metrics.areaunderroc () println ("Area under ROC =" +auroc)}}

mllib-Classification and regression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More