Simple application of Spark Mllib stochastic forest algorithm (with code) __ algorithm

Source: Internet
Author: User
Tags pyspark spark mllib
Previously, a randomized forest algorithm was applied to Titanic survivors ' predictive data sets. In fact, there are a lot of open source algorithms for us to use. Whether the local machine learning algorithm package Sklearn or distributed Spark Mllib, is a very good choice.
Spark is a popular distributed computing solution at the same time, which supports both cluster mode and local stand-alone mode. Because of its development through Scala, native support Scala, and because of Python's wide application in fields such as scientific computing, Spark also provides a python interface.

Spark's common operations are described in official documents:
Http://spark.apache.org/docs/latest/programming-guide.html


Type the following command at the end of the terminal, switch to the Spark directory, and enter the appropriate environment:
CD $SPARK _home

CD./bin

./pyspark

As you can see, there's the Python version number and the Spark logo.




At this point, still enter a sentence, run a sentence and output. You can edit the script in advance to save as filename and then:

./spark-submit filename


Detailed code is given below:



Import pandas as PD import NumPy as NP from pyspark.mllib.regression import labeledpoint from pyspark.mllib.tree import Ra Ndomforest #将类别数量大于2的类别型变量进行重新编码 and turns the dataset into a labeledpoint format #df =pd.read_csv ('/home/kim/t.txt ', index_col=0) #for col in [' Pclass ', ' embrk ']: # values=df[col].drop_duplicates () # for V in Values: # Col_name=col+str (v) # DF [Col_name]= (Df[col]==v) # df[col_name]=df[col_name].apply (Lambda x:int (x)) #df =df.drop ([' Pclass ', ' embrk '],axis=1 ') # Df.to_csv (' Train_data ') #读入数据集变成弹性分布式数据集RDD, as there is supervised learning, needs to be converted into model input format Labeledpoint Rdd=pyspark. Sparkcontext.textfile ('/home/kim/train ') Train=rdd.map (Lambda x:x.split (', ') [1]) Train=train.map (Lambda line: Labeledpoint (line[1],line[2:]) #模型训练 model=randomforest.trainclassifier\ (train, numclasses=2, categoricalfeaturesinfo={},numtrees=1000,\ featuresubsetstrategy= "Auto", impurity= ' Gini ', maxdepth=4, maxBins=32) # Contains the rdd of the Labeledpoint object, applying the features method to return the value of its input variable, and the Label method returns its true category Data_p=train.map (lambda lp:lp.features) V=traIn.map (Lambda Lp:lp.label) prediction=model.predict (data_p) vp=v.zip (prediction) #最后输出模型在训练集上的正确率 Mse=vp.map (lambda X:abs (x[0]-x[1]). SUM ())/vp.count () print ("MEAN squre ERROR:" +str (MSE))


You can test it later, for example:

Use of larger datasets;

The data set is divided into the training set test set, and the model performance is evaluated on the test set by modeling on the training set.

Use the other algorithms inside the mllib and compare the effects, etc.


Welcome everyone to communicate with me.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.