There is no operation for a while, found that they have forgotten some of the steps, this article will record the relevant steps, and at any time to make additional changes.
1 basic steps, i.e. related environment deployment and data preparation
Data file type is. csv file, Excel is saved directly as a comma delimiter
2 idea edit code, hit Jar pack
Refer to the following links:
IntelliJ idea windows under Spark development environment deployment
Idea developing Spark's endless groping (I.)
Idea developing Spark's endless groping (ii)
K-means Cluster Code reference:
Package Main.scala.yang.sparkimport org.apache.log4j. {level, Logger}import Org.apache.spark. {sparkconf, Sparkcontext}import org.apache.spark.mllib.linalg.Vectorsimport Org.apache.spark.mllib.clustering.KMeansobject kmeansbeijing {def main (args:array[string]): Unit = {//mask unnecessary logs are displayed in the final End Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn) Logger.getlogger ("Org.eclipse.jetty.server"). SetLevel (Level.off)//Set Run environment Val conf = new sparkconf (). Setmaster ("local"). Setappname ("Kmeansbeijing") val sc = New Sparkcontext (CONF)//load DataSet val data = Sc.textfile ("File:///home/hadoop/yang/USA/AUG_tag.csv", 1) Val Parsedd ATA = Data.filter (!iscolumnnameline (_)). Map (line + vectors.dense (line.split (', '). Map (_.todouble)). Cache ()/// Data aggregation classes, 7 classes, 20 iterations, model training to form a data model Val Numclusters = 4 val numiterations = + val model = Kmeans.train (Parseddata, n Umclusters, numiterations)//Print the center point of the data Model println ("Cluster Centers:") for (c <-model.clustercenters) {println ("" + c.tostring)}//Use the sum of squared errors to evaluate the data model Val cost = model.computecost (parseddata) println ("Wi Thin Set Sum of squared Errors = "+ cost"/////Use model to test single point data//println ("Vectors 0.2 0.2 0.2 is belongs to clusters:" + Model.predict (Vectors.dense ("0.2 0.2 0.2". Split ("). Map (_.todouble))))//println (" Vectors 0.25 0.25 0.25 is belongs t o Clusters: "+ model.predict (Vectors.dense (" 0.25 0.25 0.25 ". Split ("). Map (_.todouble)))//println ("Vectors 8 8 8 is b Elongs to clusters: "+ model.predict (Vectors.dense (" 8 8 8 ". Split ("). Map (_.todouble)))//cross-evaluation 1, return only the result Val testdata = Data.filter (!iscolumnnameline (_)). Map (S + = Vectors.dense (S.split (', '). Map (_.todouble))) Val result1 = model.pred ICT (testdata) result1.saveastextfile ("FILE:///HOME/HADOOP/YANG/USA/AUG/RESULT1")//cross-evaluation 2, return data set and result val result2 = Data.filter (!iscolumnnameline (_)). map {line = val Linevectore = vectors.dense (Line.split (', '). Map (_.todou ble)) Val predictiOn = model.predict (linevectore) line + "" + Prediction}.saveastextfile ("File:///home/hadoop/yang/USA/AUG/resul T2 ") Sc.stop ()} Private def iscolumnnameline (line:string): Boolean = {if (line! = null && line.contains ("Electricity")) True Else false}}
3 Uploading the jar package to the local server on the Hadoop platform via WINSCP
Note: Simply drag and drop
4 Execute related commands on the Hadoop platform via SECURECRT
4.1 Enter the Spark folder
4.2 Submitting a Task (jar package) to a cluster via the Spark-submit command
4.3 Viewing results with WINSCP
Note: 4.1 and 4.2 can be combined in one command:
Implement K-means clustering algorithm through idea and Hadoop platform