Previously, a randomized forest algorithm was applied to Titanic survivors ' predictive data sets. In fact, there are a lot of open source algorithms for us to use. Whether the local machine learning algorithm package Sklearn or distributed Spark Mllib, is a very good choice.
Spark is a popular distributed computing solution at the same time, which supports both cluster mode and local stand-alone mode. Because of its development through Scala, native support Scala, and because of Python's wide application in fields such as scientific computing, Spark also provides a python interface.
Spark's common operations are described in official documents:
Http://spark.apache.org/docs/latest/programming-guide.html
Type the following command at the end of the terminal, switch to the Spark directory, and enter the appropriate environment:
CD $SPARK _home
CD./bin
./pyspark
As you can see, there's the Python version number and the Spark logo.
At this point, still enter a sentence, run a sentence and output. You can edit the script in advance to save as filename and then:
./spark-submit filename
Detailed code is given below:
Import pandas as PD import NumPy as NP from pyspark.mllib.regression import labeledpoint from pyspark.mllib.tree import Ra Ndomforest #将类别数量大于2的类别型变量进行重新编码 and turns the dataset into a labeledpoint format #df =pd.read_csv ('/home/kim/t.txt ', index_col=0) #for col in [' Pclass ', ' embrk ']: # values=df[col].drop_duplicates () # for V in Values: # Col_name=col+str (v) # DF [Col_name]= (Df[col]==v) # df[col_name]=df[col_name].apply (Lambda x:int (x)) #df =df.drop ([' Pclass ', ' embrk '],axis=1 ') # Df.to_csv (' Train_data ') #读入数据集变成弹性分布式数据集RDD, as there is supervised learning, needs to be converted into model input format Labeledpoint Rdd=pyspark. Sparkcontext.textfile ('/home/kim/train ') Train=rdd.map (Lambda x:x.split (', ') [1]) Train=train.map (Lambda line: Labeledpoint (line[1],line[2:]) #模型训练 model=randomforest.trainclassifier\ (train, numclasses=2, categoricalfeaturesinfo={},numtrees=1000,\ featuresubsetstrategy= "Auto", impurity= ' Gini ', maxdepth=4, maxBins=32) # Contains the rdd of the Labeledpoint object, applying the features method to return the value of its input variable, and the Label method returns its true category Data_p=train.map (lambda lp:lp.features) V=traIn.map (Lambda Lp:lp.label) prediction=model.predict (data_p) vp=v.zip (prediction) #最后输出模型在训练集上的正确率 Mse=vp.map (lambda X:abs (x[0]-x[1]). SUM ())/vp.count () print ("MEAN squre ERROR:" +str (MSE))
You can test it later, for example:
Use of larger datasets;
The data set is divided into the training set test set, and the model performance is evaluated on the test set by modeling on the training set.
Use the other algorithms inside the mllib and compare the effects, etc.
Welcome everyone to communicate with me.