mllib-Classification and regression

Source: Internet
Author: User
Tags svm

Mllib supports a variety of methods for two classification, polyphenols and regression analysis, as follows:

Problem category Support methods
Two categories Linear support vector machines, logistic regression, decision trees, naive Bayes
Multi-classification Decision Tree, Naive Bayes
Regression Linear least squares, lasso,ridge regression, decision Trees

    • Linear model
      • Two classifications (support vector machines, logistic regression)
      • Linear regression (least squares, Lasso, ridge)
    • Decision Tree
    • Naive Bayesian

Linear model
    • Mathematical formulas
      • Loss function
      • Regularization
      • Optimized
    • Two categories
      • Linear Support Vector Machine
      • Logistic regression
      • Evaluation matrix
      • Example
    • Linear least squares, lasso,ridgeregression
      • Example
    • Flow-Type linear regression
      • Example
    • Execute (Developer)

Mathematical formulas

Loss function

Regularization

Optimized

Two categories

The purpose of the two-classification problem is to divide the project into two categories: positive and negative examples. Mllib supports two two classification methods: linear support vector machines and logistic regression. Both of these methods support the two regular variables of L1 and L2. The training data is expressed as the labeledpoint format of the mllib. Note: In mathematical formulas, the training Mark Y is expressed as +1 (positive example) or 1 (negative example), which is for convenience of the formula representation. However, in Mllib in order to be compatible with multi-classification cases, the negative example is represented by 0 instead of-1.

Linear support vector Machine (SVMs)

Linear support vector Machine (SVM) is a standard method for large-scale classification problems. The linear representation is as follows

\ (L (w;x,y): =\max\{0, 1-yw^{t}x\}\)

By default, use L2 regularization. L1 is optional. In this way, the problem is programmed with a linear program.

The linear SVM algorithm outputs a SVM model. Given a new data point, represented by \ (x\), the model predicts based on the value of \ (w^{t}x\). By default, if the \ (w^{t}x \geq 0\) output is positive, negative values are otherwise.

Logistic regression

Logistic regression is widely used for two problems. The linear representation is as follows, using the logistic loss function:

\ (L (w;x,y): = Log (1+ exp (-yw^{t}x)) \)

The logistic regression algorithm outputs a logistic regression model. Given a new data point, denoted by \ (x\), the model uses the following loss function to make predictions

\ (f (z) =\frac{1}{1+e^{-z}} \)

Here \ (z=w^{t}x \). By default, if \ (f (w^{t}x) > 0.5 \), the output is positive and vice versa. Unlike linear SVM, the original output of the logistic regression model, \ (f (z) \), has a probability interpretation (for example; \ (x\) is a positive probability)

Evaluation matrix

Mllib supports commonly used two classification evaluation matrices (Pyspark not supported). Includes accuracy, recall rate, F-value, receiver operating characteristic (ROC), accuracy rate of recall curve, and area under the Curvers (AUC). AUC is a common model performance comparison method used to help users select a predictive threshold value (http://www.douban.com/note/284051363/?type=like) with an accuracy/recall/F value.

Example

The following code shows how to load a simple data set, train the data using the static method on the algorithm object, and use the resulting model to make predictions and calculate the training error rate.

Importorg.apache.spark._ImportOrg.apache.spark.mllib.classification.SVMWithSGDImportOrg.apache.spark.mllib.evaluation.BinaryClassificationMetricsImportorg.apache.spark.mllib.util.MLUtils/*** Created by ******qin on 2015-1-13.*/Object CLASSIFYSVM {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("CLASSIFYSVM")Val SC=Newsparkcontext (conf)//load training data in LIBSVM formatVal data = Mlutils.loadlibsvmfile (SC, args (0))//args (0)println (Data.count ())//split data into training and test (6:4)Val splits = Data.randomsplit (Array (0.6, 0.4), seed = 11L) Val Training= Splits (0). Cache () Val Test= Splits (1)    //Run training algorithm to build the modelVal numiterations = 100Val Model=Svmwithsgd.train (Training, numiterations)//Clean the default thresholdModel.clearthreshold ()//compute RAW Scores on the test setVal scoreandlabels = Test.map {point = =Val Score=model.predict (Point.features) (Score, Point.label)}//Get evaluation MetricsVal metrics =NewBinaryclassificationmetrics (scoreandlabels) Val Auroc=Metrics.areaunderroc () println ("Area under ROC =" +auroc)}}

mllib-Classification and regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.