Spark Training Classification Model exercises (1)

Source: Internet
Author: User
Tags svm first row

() Ben Boven Study notes for the 5th chapter of Spark Machine learning .
The data download address is: Experimental data set TRAIN.TSV

The data meanings of each column are:
"url" "Urlid" "boilerplate" "alchemy_category" "Alchemy_category_score" "Avglinksize" "Commonlinkratio_1" " Commonlinkratio_2 "Commonlinkratio_3" "Commonlinkratio_4" "Compression_ratio" "Embed_ratio" "framebased" " Frametagratio "Hasdomainlink" "Html_ratio" "Image_ratio" "Is_news" "Lengthylinkdomain" "Linkwordscore" "News_front_ Page "" Non_markup_alphanum_characters "" numberOfLinks "" Numwords_in_url "" Parametrizedlinkratio "" Spelling_errors_ Ratio "" Label "

The first four columns mean: Link address, page ID, page content, page category
Followed by 22 columns: various numeric or category features
The last column: target value, 1 is long; 0 is not long

Use the pipeline in the Linux command line to remove the first row:

$ sed 1d train.tsv > TRAIN_NOHEADER.TSV

Open Spark-shell

Val rawdata = Sc.textfile ("FILE:///HOME/HADOOP/TRAIN_NOHEADER.TSV")
val records = Rawdata.map (line = Line.split ("\ T"))
Records.first ()

The output is:
Array ("http://www.bloomberg.com/news/2010-12-23/ Ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html "," 4042 "," {"" title "": "" IBM sees holographic Calls Air breathing Batteries IBM sees holographic Calls, air-breathing batteries "", "" Body "": "" A sign stands outside the I Nternational Business Machines Corp. IBM Almaden Center Campus in San Jose California photographer Tony Avelar Blo Omberg buildings stand at the International Business Machines Corp. IBM Almaden Center Campus in the Santa Teresa Hills of San Jose California photographer Tony Avelar Bloomberg by your mobile phone would project a 3 D image of Anyo NE who calls and your laptop would be a powered by kinetic energy at least that s what International Business Machines Corp s Ees ...

Import org.apache.spark.mllib.regression.LabeledPoint Import org.apache.spark.mllib.linalg.Vectors//Remove superfluous ", and fill the missing data (". "), generate Labelpoint training data val data = records.map {r = val trimmed = R.map (_.replaceall (" \ "", "")) Val label = trimmed (R.S ize-1). ToInt val features = Trimmed.slice (4, r.size-1). Map (d = + if (d = = "?") 0.0 else d.todouble) labeledpoint (Lab El, Vectors.dense (Features))}//Convert negative eigenvalues to 0, easy naive Bayesian model training val nbdata = records.map {r = val trimmed = R.map (_.replacea  LL ("\" "," ")) Val label = trimmed (r.size-1). ToInt val features = Trimmed.slice (4, r.size-1). Map (d + if (d = ="? ") 0.0 else d.todouble). map (d = if (d < 0) 0.0 else D) labeledpoint (label, Vectors.dense (Features))}//Training classification Model: IMP ORT Org.apache.spark.mllib.classification.LogisticRegressionWithSGD//logistic return Import Org.apache.spark.mllib.classification.SVMWithSGD//SVM import Org.apache.spark.mllib.classification.NaiveBayes// Naive Bayesian Import Org.apache.spark.mllib.tree.DecisionTree//DecisionTree Import Org.apache.spark.mllib.tree.configuration.Algo Import org.apache.spark.mllib.tree.impurity.Entropy//entropy impurity Val numiterations = ten val maxtreedepth = 5//each model training val Lrmodel = logisticregressionwithsgd.train (data, numiterations) v Al Svmmodel = Svmwithsgd.train (data, numiterations) val Nbmodel = Naivebayes.train (nbdata) val Dtmodel = Decisiontree.trai N (data, algo.classification, Entropy, maxtreedepth)

It is simple to use the model of training to predict unknown data, with logistic regression as an example:

Val datapoint = Data.first
val prediction = lrmodel.predict (datapoint.features)//Based on eigenvalues, make predictions
                                                    // Datapoint.label
                                                    //Datapoint.features

Output:prediction:double = 1.0 2 Classification performance evaluation 2.1 forecast accuracy and error rate

Correct rate: The number of correct classifications divided by the total number of samples (positive sample + negative sample) in the training sample.
Error Rate: In the training sample, the data is divided by the total number of samples (positive sample + negative sample).

Average correct rate per algorithm
//
val lrtotalcorrect = data.map {point = =
(lrmodel.predict (point.features) = = Point.label) 1 Else 0
}.sum
val lraccuracy = Lrtotalcorrect/data.count
//svm
val svmtotalcorrect = data . map {point =
if (svmmodel.predict (point.features) = = Point.label) 1 Else 0
}.sum
//nb
val Nbtotalcorrect = nbdata.map {point =
if (nbmodel.predict (point.features) = = Point.label) 1 Else 0
}.sum
  //dt need to give threshold
val dttotalcorrect = data.map {point =
val score = dtmodel.predict (point.features)
Val predicted = if (Score > 0.5) 1 Else 0
if (predicted = = Point.label) 1 Else 0
}.sum
//Seek correct rate:
val Svmaccuracy = Svmtotalcorrect/numdata
val nbaccuracy = Nbtotalcorrect/numdata
val dtaccuracy = DtTotalCorrec T/numdata

Output Result:
2.2 Forecast accuracy rate and recall rate PR curve

Defined:

In the two category:
accuracy: the number defined as true positive divided by the sum of true and false positives; (true positive: A sample of 1 is correctly predicted, false negative is a sample of category 1 incorrectly predicted.) )
Meaning: In the result, a meaningful proportion. (Quality of evaluation results).
recall: the number defined as true positive divided by true-positive and false-negative, where false-negative was a sample of 1 but was predicted to be 0.
Meaning: 100%, which means I can detect all the positive samples. (Evaluate the integrity of the algorithm).
The PR curve is that the horizontal axis is the recall rate and the longitudinal shaft is the curve formed by the accuracy rate. 2.3 roc curve and AUC

The ROC curve is a true positive rate – a graphical interpretation of the false positive rate .

True positive Rate (TPR)-The number of true-positive samples divided by true-positive and false-negative samples.
False positive Rate (FPR)--the number of false positive samples divided by the sum of false positive and true negative samples.
The rational situation is under the ROC area (AUC) is 1, the closer the 1 the better.

Import Org.apache.spark.mllib.evaluation.BinaryClassificationMetrics//LR and SVM val metrics = Seq (Lrmodel, Svmmodel). Ma p {model = val Scoreandlabels = Data.map {point = = (Model.predict (point.features), Point.label)} val metrics = n EW Binaryclassificationmetrics (scoreandlabels) (Model.getClass.getSimpleName, METRICS.AREAUNDERPR, 
METRICS.AREAUNDERROC)}//NB val nbmetrics = Seq (Nbmodel). map{model = Val Scoreandlabels = Nbdata.map {point = = Val score = model.predict (point.features) (if (Score > 0.5) 1.0 else 0.0, Point.label)} val metrics = new Binaryclass Ificationmetrics (Scoreandlabels) (Model.getClass.getSimpleName, METRICS.AREAUNDERPR, Metrics.areaunderroc)}//DT Decision tree Val dtmetrics = Seq (Dtmodel). map{model = Val Scoreandlabels = data.map {point = Val score = model.predict (PO Int.features) (if (Score > 0.5) 1.0 else 0.0, Point.label)} val metrics = new Binaryclassificationmetrics (Scoreandlabe LS) (Model.getClass.getSimpleName, METRICS.AREAUNDERPR, METRICS.AREAUNDERROC)}//Total output: Val allmetrics = metrics + + nbmetrics + dtmetrics allmetrics.foreach{case (M,pr,roc) = println (f "$m, area under PR: ${PR * 100.0}%2.4f%%,area under ROC: ${roc *100}%2.4f%%")}

Result output:

The result of the algorithm is not ideal, and the method of parameter tuning is discussed in the next section.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.