Pyspark machine Learning (2)--GBDT

Source: Internet
Author: User
Tags pyspark


This article mainly implements the GBDT algorithm in the Pyspark environment, the implementation code looks like this:


%pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer
from numpy import allclose
from pyspark.sql.types import *
#1. Read data
data = spark.sql("""select * from XXX""")
#2. Construct training data set
dataSet = data.rdd.map(list)
(trainData, testData) = dataSet.randomSplit([0.75, 0.25])
trainingSet = trainData.map(list).map(lambda x:Row(label=x[-1], features=Vectors.dense(x[:-1]))).toDF()
train_num = trainingSet.count()
Print ("number of training samples: {}". Format (train [Num))
#print(trainingSet.show())
#3. Use gbdt for training
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(trainingSet)
tf = si_model.transform(trainingSet)
gbdt = GBTClassifier(maxIter=50, maxDepth=6, labelCol="indexed", seed=42)
gbdtModel = gbdt.fit(tf)
print(gbdtModel.featureImportances)
#4. test
data = spark.sql("""select * from XXX""")
#Construct test data set
testSet = data.rdd.map(list).map(lambda x:Row(label=x[-1], features=Vectors.dense(x[:-1]))).toDF()
Print ("number of test samples: {}". Format (testset. Count()))
#print(testSet.show())
si_model = stringIndexer.fit(testSet)
test_tf = si_model.transform(testSet)
result = gbdtModel.transform(test_tf)
#result.show()
#5. Classification effect evaluation
total_amount=result.count()
correct_amount = result.filter(result.indexed==result.prediction).count()
precision_rate = correct_amount/total_amount
Print ("the prediction accuracy is: {}". Format (precision rate))
positive_precision_amount = result.filter(result.indexed == 0).filter(result.prediction == 0).count()
negative_precision_amount = result.filter(result.indexed == 1).filter(result.prediction == 1).count()
positive_false_amount = result.filter(result.indexed == 0).filter(result.prediction == 1).count()
negative_false_amount = result.filter(result.indexed == 1).filter(result.prediction == 0).count()
Print ("positive sample prediction accuracy quantity: {}, negative sample prediction accuracy quantity: {}". Format (positive precision amount, negative precision amount))
positive_amount = result.filter(result.indexed == 0).count()
negative_amount = result.filter(result.indexed == 1).count()
Print ("positive sample number: {}, negative sample number: {}". Format (positive amount, negative amount))
Print ("number of positive sample prediction errors: {}, number of negative sample prediction errors: {}". Format (positive ﹣ false ﹣ amount, negative ﹣ false ﹣ amount))
recall_rate1 = positive_precision_amount/positive_amount
recall_rate2 = negative_precision_amount/negative_amount
Print ("positive sample recall rate is: {}, negative sample recall rate is: {}". Format (recall \ rate1, recall \ rate2)) 




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.