Spark (11)--Mllib API Programming Linear regression, Kmeans, collaborative filtering demo

Source: Internet
Author: User

The spark version tested in this article is 1.3.1

Before using Spark's machine learning algorithm library, you need to understand several basic concepts in mllib and the type of data dedicated to machine learning

Eigenvector Vector:

The concept of vector is the same as the vector in mathematics, and the popular view is actually an array of double data.
Vectors are divided into two types, namely, intensive and sparse.
Here's how to create it:

... val vector = Vector.dense(array)//创建密集向量 val vector = Vector.sparse(array)// 创建稀疏向量

Note: Scala refers to Scala.collection.immutable.Vector by default, in order to use vectors in Mllib, you must display the introduction Org.apache.spark.mllib.linalg.Vector

For dense vectors and sparse vectors:
The value of a dense vector is a normal double array
The sparse vector consists of two arrays of indices and values in parallel.
For example: Vector (1.0,0.0,3.0) is expressed in dense format as [1.0,0.0,3.0], in sparse format (3,[0,2],[1.0,3.0])
The first 3 indicates the length of the vector, [0,2] is the indices array, [1.0,3.0] is the values array
The value representing the position of vector 0 is 1.0, and the value of 2 is 3.0

Point labeledpoint with class labels:

A point with a class label consists of a class label (double type) and a vector (dense or sparse)
In Mllib, supervised learning algorithms use Labeledpoint, such as: Regression and classification.

Labeledpoint through case class Labeledpoint to create

val pos = LabeledPoint(1.0,Vector.dense(1.0,0.0,3.0neg = LabeledPoint(1.0,Vector.sparse(3,Array(0,2),Array(1.0,3.0)))

Matrices Matrix:

The matrix is divided into two kinds, local matrix and distributed matrix.
The local matrix is created in the following way:

val dm:Matrix = Matrices.dense(3,2,Array(1.0,3.0,5.0,2.0,4.0,6.0))//创建一个3*2的密集矩阵

As you can see, the way it is stored is a matrix size (3,2) and a one-dimensional array [1.0,3.0,5.0,2.0,4.0,6.0]

Distributed matrix:
A distributed matrix consists of a long column index and a double type of data, distributed in one or more Rdd

The most basic Rowmatrix: a line-oriented distributed matrix whose row index has no specific meaning, which represents all rows through an RDD, each of which is a local vector

Rowmatrix Creation Method:

...//RowMatrix可以从一个RDD[Vector]类型创建出来val mat:RowMatrix = new RowMatrix(rows)//获得RowMatrix的sizeval r = mat.numRows()val c = mat.numCols()

Row index matrix Indexedrowmatrix: similar to Rowmatrix, but its row index has meaning and can be used to retrieve information

How to create:

...//IndexedRowMatrix可以从一个RDD[IndexedRow]中创建,IndexedRow其实就是一个(Long,Vector)的封装类,就是比创建RowMatrix时多需要了一个Long类型的行索引val mat:IndexedRowMatrix = new IndexedRowMatrix(rows)//获得IndexedRowMatrix 的sizeval r = mat.numRows()val c = mat.numCols()//如果剔除掉IndexedRowMatrix 的行索引,就会变为一个RowMatrixval rowMatrix = mat.toRowMatrix()

Ternary matrix Coordinatematrix: Actually the body set is an RDD, each entity is a (i:long,j:long,value:double) ternary group, where I is the row index, J is the column index, and value is the corresponding data. Generally only used when the matrix is large and sparse

Coordinatematrix is created in the following ways:

...//CoordinateMatrix可以从一个RDD[MatrixEntry]中创建,MatrixEntry其实就是一个(Long,Long,Double)的封装类val mat:CoordinateMatrix = new CoordinateMatrix(enties)//获得mat:CoordinateMatrix 的sizeval r = mat.numRows()val c = mat.numCols()//将其转换成IndexedRowMatrix,但是这个IndexedRowMatrix的行是稀疏的val indexedRowMatrix = mat.toIndexedRowMatrix()

In fact, from Rowmatrix,indexedrowmatrix to Coordinatematrix, is a step-by-step upgrade, all three have an rdd to represent all the entities, only the entity is different
Each entity of the Rowmatrix is a local vector
Each entity of the Indexedrowmatrix is a long row index plus + local vector
Each entity of the Coordinatematrix is a two long column index + local vector

All three of them are created in a similar way.
Rowmatrix is created by a rdd[vector], and a Vector actually represents an array of type Double, which is converted to an RDD.

Indexedrowmatrix is created by a rdd[indexedrow], Indexedrow is the encapsulated (long,vector) type, believed to be able to create rowmatrix through vectors, Creating Indexedrowmatrix with Indexedrow is not a difficult thing.

Coordinatematrix through a rdd[matrixentry] to create, Matrixentry is more simple, is directly a (long,long,double) packaging class, even the vector does not need

The concept of the three links is all that, don't be frightened by their names

The above is a few basic concepts and data types introduced in Mllib, more related operations such as: summary of the matrix statistics and correlation calculation, stratified sampling, hypothesis testing, random data generation, etc. please refer to the official documentation (in fact, provide a similar static tool class, call its method can be)

The following example shows the mllib of linear regression, Kmeans, and collaborative filtering of three algorithms.

Linear regression:

The mllib dedicated data type used in this example is Labelpoint

The test data is as follows:

Test data

Object Linearregression {def main (args:array[string]) {if (args. Length<2) {System. Err. println("Usage: <master> ) System. Exit(1)} val conf = new sparkconf (). Setmaster(Args (0)). Setappname("Linearregression"Val sc = new Sparkcontext (conf)//Read the test data on HDFs and convert it to labeledpoint type val data = SC. Textfile(Args (1)). Map{lines = val parts = lines. Split(",") Labeledpoint (Parts (0). ToDouble, Vectors. Dense(Parts (1). Split(" "). Map(_. ToDouble))}//Set the number of iterations of the algorithm val numiterations = -Using the train of the LINEARREGRESSIONWITHSGD class, the data (Labeledpoint) is passed into the model training to get an evaluation model VAL model = LINEARREGRESSIONWITHSGD. Train(data, numiterations)//Use the Predict method of the model to make predictions, using Labeledpoint's features (that is, the value part) as the predictive data, And will predict the result and labeledpoint the label (class label) part, form a tuple to return val result = data. Map{point = Val prediction = Model. Predict(Point. Features) (Point. Label, prediction)}//Output tuple content result. foreach(println)//Calculate model rating Star MSE val MSE = result. Map{case (V, p) = + Math. Pow((V-p),2) }. Mean() println ("Train result MSE:"+ MSE)}

Kmeans algorithm:

The Mllib private data types used in the Kmeans algorithm are: Vector

Test data

Object Kmeans {def main (args:array[string]) {if (args. Length<3) {System. Err. println("Usage: <master> ) System. Exit(1)} val conf = new sparkconf (). Setmaster(Args (0)). Setappname("Kmeans"Val sc = new Sparkcontext (conf)//Read data and convert to dense vector val data = sc. Textfile(Args (1)). Map{lines = Vectors. Dense(lines. Split(" "). Map(_. ToDouble)}//Instantiate the Kmeans class, which is used to do some setup and run algorithms for the algorithm val km = new Kmeans ()//Set the number of cluster center points to2, the maximum number of iterations is -, run method begins operation, incoming test data set Val model = km. SETK(2). Setmaxiterations( -). Run(data)//output The resulting model of the cluster center println ("Cluster num:"+ model. K) for (I <-model. Clustercenters) {println (i. toString)} println ("----------------------------------------")//Use custom data to test the model to determine which cluster center the vectors belong to println ("Vector 0.2 0.2 0.2 is closing to:"+ model. Predict(Vectors. Dense("0.2 0.2 0.2". Split(" "). Map(_. ToDouble))) println ("Vector 0.25 0.25 0.25 is closing to:"+ model. Predict(Vectors. Dense("0.25 0.25 0.25". Split(" "). Map(_. ToDouble))) println ("Vector 8 8 8 is closing to:"+ model. Predict(Vectors. Dense("8 8 8". Split(" "). Map(_. ToDouble))) println ("----------------------------------------")//To predict the test data to be passed into the model again as predictive data val result0 = model. Predict(data). Collect(). foreach(println) println ("----------------------------------------")//Data obtained results, saved in HDFs (direct printing can also) Val result = Data. Map{lines = val res = model. Predict(lines) lines +"Clustingcenter:"+ res}. Saveastextfile(Args (2))  }}

Collaborative filtering:

In this algorithm, there is a rating data type that is specifically used to compute collaborative filtering
Rating is defined as follows: Rating (user:int,product:int,rating:double)
User: Subscriber ID
Product: ID (can be a variety of movies, commodities, etc.)
Rating: The user's rating for this product

The test data is as follows:

Test data

Object CF {def main (args: Array[string]) {if(Args.length <2) {System.err.println ("Usage: <master> ) System.exit (1)} val conf =NewSparkconf (). Setmaster (Args (0). Setappname ("Collaborative Filtering") Val sc =NewSparkcontext (CONF)//Read the file and convert to rating type val ratings = Sc.textfile(args (1)).Map(_.split ("::") match {  case Array (user, item, rate) = Rating (User.toint, Item.toint, rate.todouble)})Set number of stealth factors, number of iterationsVal Rank= 10Val numiterations= 5//CallALSClass ofTrainMethods, passing in the data to be trained and so on model trainingVal Model=ALS.Train(ratings, rank, numiterations, 0.01)Convert the training data into(User,item)Format to be used as a test model for predicting data (collaborative filtering of model predictions when the incoming(User,item), and then predict eachUser-Itemcorresponding rating)Val usersproducts=Ratings.Map{ Case Rating(user, item, rate)=(User, item)}//Call model predict, pass the test data in the (user, item) format to make predictions, and get the result (user,item,rating) val prediction = Model.predict (usersproducts). Map {  case Rating(user, item, rate) = =(User, item), rate)}//The resulting forecast score and the original data are join operations to observe the accuracy of the predictions val result = Ratings.map { CaseRating (user, item, rate) =(User, item), rate)}. Join (prediction) Result.collect (). foreach (println)}}

In fact, through the official document of the example divert, call Mllib in the various algorithms can be very convenient and fast machine learning, but this is only called the algorithm Library, machine learning in the various algorithmic principles need to be deeply understood and mastered. Otherwise, the use of mllib can be so simple to call the machine learning algorithm, then compared with others, how can you highlight your strengths?

Spark (11)--Mllib API Programming Linear regression, Kmeans, collaborative filtering demo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.