Mllib supports a variety of methods for two classification, polyphenols and regression analysis, as follows:
Problem category |
Support methods |
Two categories |
Linear support vector machines, logistic regression, decision trees, naive Bayes |
Multi-classification |
Decision Tree, Naive Bayes |
Regression |
Linear least squares, lasso,ridge regression, decision Trees |
- Linear model
- Two classifications (support vector machines, logistic regression)
- Linear regression (least squares, Lasso, ridge)
- Decision Tree
- Naive Bayesian
Linear model
- Mathematical formulas
- Loss function
- Regularization
- Optimized
- Two categories
- Linear Support Vector Machine
- Logistic regression
- Evaluation matrix
- Example
- Linear least squares, lasso,ridgeregression
- Flow-Type linear regression
- Execute (Developer)
Mathematical formulas
Loss function
Regularization
Optimized
Two categories
The purpose of the two-classification problem is to divide the project into two categories: positive and negative examples. Mllib supports two two classification methods: linear support vector machines and logistic regression. Both of these methods support the two regular variables of L1 and L2. The training data is expressed as the labeledpoint format of the mllib. Note: In mathematical formulas, the training Mark Y is expressed as +1 (positive example) or 1 (negative example), which is for convenience of the formula representation. However, in Mllib in order to be compatible with multi-classification cases, the negative example is represented by 0 instead of-1.
Linear support vector Machine (SVMs)
Linear support vector Machine (SVM) is a standard method for large-scale classification problems. The linear representation is as follows
\ (L (w;x,y): =\max\{0, 1-yw^{t}x\}\)
By default, use L2 regularization. L1 is optional. In this way, the problem is programmed with a linear program.
The linear SVM algorithm outputs a SVM model. Given a new data point, represented by \ (x\), the model predicts based on the value of \ (w^{t}x\). By default, if the \ (w^{t}x \geq 0\) output is positive, negative values are otherwise.
Logistic regression
Logistic regression is widely used for two problems. The linear representation is as follows, using the logistic loss function:
\ (L (w;x,y): = Log (1+ exp (-yw^{t}x)) \)
The logistic regression algorithm outputs a logistic regression model. Given a new data point, denoted by \ (x\), the model uses the following loss function to make predictions
\ (f (z) =\frac{1}{1+e^{-z}} \)
Here \ (z=w^{t}x \). By default, if \ (f (w^{t}x) > 0.5 \), the output is positive and vice versa. Unlike linear SVM, the original output of the logistic regression model, \ (f (z) \), has a probability interpretation (for example; \ (x\) is a positive probability)
Evaluation matrix
Mllib supports commonly used two classification evaluation matrices (Pyspark not supported). Includes accuracy, recall rate, F-value, receiver operating characteristic (ROC), accuracy rate of recall curve, and area under the Curvers (AUC). AUC is a common model performance comparison method used to help users select a predictive threshold value (http://www.douban.com/note/284051363/?type=like) with an accuracy/recall/F value.
Example
The following code shows how to load a simple data set, train the data using the static method on the algorithm object, and use the resulting model to make predictions and calculate the training error rate.
Importorg.apache.spark._ImportOrg.apache.spark.mllib.classification.SVMWithSGDImportOrg.apache.spark.mllib.evaluation.BinaryClassificationMetricsImportorg.apache.spark.mllib.util.MLUtils/*** Created by ******qin on 2015-1-13.*/Object CLASSIFYSVM {def main (args:array[string]) {val conf=NewSparkconf (). Setappname ("CLASSIFYSVM")Val SC=Newsparkcontext (conf)//load training data in LIBSVM formatVal data = Mlutils.loadlibsvmfile (SC, args (0))//args (0)println (Data.count ())//split data into training and test (6:4)Val splits = Data.randomsplit (Array (0.6, 0.4), seed = 11L) Val Training= Splits (0). Cache () Val Test= Splits (1) //Run training algorithm to build the modelVal numiterations = 100Val Model=Svmwithsgd.train (Training, numiterations)//Clean the default thresholdModel.clearthreshold ()//compute RAW Scores on the test setVal scoreandlabels = Test.map {point = =Val Score=model.predict (Point.features) (Score, Point.label)}//Get evaluation MetricsVal metrics =NewBinaryclassificationmetrics (scoreandlabels) Val Auroc=Metrics.areaunderroc () println ("Area under ROC =" +auroc)}}
mllib-Classification and regression