Introduction to Spark Mlbase Distributed Machine Learning System: Implementing Kmeans Clustering Algorithm with Mllib

Source: Internet
Author: User
Tags svm hadoop fs spark mllib

1. What is Mlbase
Mlbase is part of the spark ecosystem and focuses on machine learning with three components: MLlib, MLI, ML Optimizer.

    • ml optimizer:this layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included Inmli and MLlib. The ML Optimizer is currently under active development.
    • Mli:an experimental API for feature extraction and algorithm development, introduces high-level ML program Ming abstractions. A prototype of MLI have been implemented against Spark, and serves as a testbed for MLlib.
    • Mllib:apache Spark ' s distributed ML Library. MLlib was initially developed as part of the Mlbase project, and the library are currently supported by the Spark community . Many features in MLlib has been borrowed from ML Optimizer and MLI, e.g., the model and algorithm APIs, Multimodel Traini Ng, sparse data support, design of local/distributed matrices, etc.

Process of 2.MLbase machine learning algorithms
Users can easily use the Mlbase tool to process their own data. Most machine learning algorithms involve training and predicting two parts, training models, and predicting unknown samples. The same is true for machine learning packages in spark.

Spark divides the machine learning algorithm into two modules:

    • Training module: Output model parameters by training sample
    • Predictive module: Use model parameters to initialize, predict test samples, output and values.

Mlbase offers a functional programming language, Scala, which makes it easy to implement common algorithms for machine learning with Mllib.
For example, we want to classify, just write the following Scala code:

1 var X = load ("some_data", 2to2 var y = load ("som E_data", 1)3 var (fn-model, summary) = Doclassify (X, y)

Code interpretation: X is a data set that needs to be categorized, and Y is a category tag, doclassify (), that is taken from the data set.

There are two main benefits to this approach:

    1. Each step of data processing is very clear, can be easily visualized;
    2. For users, the use of ML algorithm processing is transparent, do not care about and consider what classification method, is the SVM or ADABOOST,SVM with the kernel is linear or RBF, original and scaled parameters to adjust how much and so on.

One of the three major components of mlbase: ML Optimizer, which selects the machine learning algorithms and related parameters that it deems most appropriate to have been implemented internally, to process the data entered by the user, and to return the results of the model or other help analysis. The overall process flow is as follows:

    1. A user input task such as doclassify (x, y) or co-filtering docollabfilter (x, y), figure calculation findtopkdegreenodes (G, k = 1000) Such things, will first be passed to parser processing, Then hand it over to the LLP. LLP is a logical learning plan, which is a logical learning selection process in which to choose what algorithm to use, what the feature extraction should do, what parameters should be selected, and how the data set can be used to split the data set.
    2. The LLP decides to be handed over to optimizer. Optimizer is the core of Mlbase, it will split the data into several parts, each using different algorithms and parameters to calculate the results, see which way to get the best results (note that the optimal results are preliminary), the optimizer finished these things to the PLP.
    3. PLP is the physical learning Plan, the physical (actual) execution of the program, so that Mlbase's master assigns the task to the specific slave to the final execution of the selected algorithm, the results are calculated back, and return to the calculation of the learning model.
    4. This process is Parser, Optimizer, PLP, LLP-----Result/model, a task-------------the first logical, in the existing algorithm to select a few suitable for this scenario, let the optimizer go To do it again, the solution that is considered optimal at the time is executed to the actual execution, and the result is returned.

Mlbase not only returns the results to the user. In LLP, Optimizer,mlbase stores some intermediate results and features, and then continues to search for and test results for better algorithms and related parameters, and notifies the user. The algorithms implemented within the LLP can be augmented.

In short, mlbase will automatically find the right algorithm, automatic selection and optimization, but also can be expanded.

3.Scala implementation of Kmeans algorithm

3.1 What is the Kmeans algorithm
K-means algorithm is a kind of cluster analysis algorithm, it is mainly to calculate the data aggregation algorithm, mainly through the continuous extraction of the seed point of the nearest mean algorithm.
Specifically, the K clusters that meet the minimum variance criteria are output by inputting the number of clusters K and the database containing N data objects.
Basic steps of 3.2 K-means algorithm
(1) Arbitrary selection of K objects from N data Objects as the initial cluster center;
(2) According to the mean value of each cluster object (the center object), the distance between each object and the central object is calculated, and the corresponding object is divided according to the minimum distance;
(3) Recalculate the mean value of each (changed) cluster (center object);
(4) Calculates the standard measure function, when satisfies certain condition, if the function converges, then the algorithm terminates; if the condition is not satisfied then go back to step (2).
The upper bound of time complexity of the algorithm is O (n*k*t), where T is the number of iterations and N data objects are divided into K clusters.

3.3 Mllib Implementation Kmeans

With mllib realization Kmeans algorithm, using mllib Kmeans training model, the new data can be classified and predicted, see the code and output results.

Scala code:

1  PackageCOM.HQ2 3 ImportOrg.apache.spark.mllib.clustering.KMeans4 Importorg.apache.spark.mllib.linalg.Vectors5 ImportOrg.apache.spark. {sparkcontext, sparkconf}6 7 Object Kmeanstest {8 def main (args:array[string]) {9     if(Args.length < 1) {TenSystem.err.println ("Usage: <file>") OneSystem.exit (1) A     } -  -Val conf =Newsparkconf () theVal sc =Newsparkcontext (conf) -Val data = sc.textfile (args (0)) -Val parseddata = Data.map (s = = Vectors.dense (S.split ("). Map (_.todouble )) -Val numclusters = 2 +Val numiterations = 20 -val clusters =Kmeans.train (parseddata,numclusters,numiterations) +  Aprintln ("------Predict the existing line in the analyzed data file:" +args (0)) atprintln ("Vector 1.0 2.1 3.8 belongs to Clustering" + clusters.predict (Vectors.dense ("1.0 2.1 3.8". Split ("). Map (_.todouble ))) -println ("Vector 5.6 7.6 8.9 belongs to Clustering" + clusters.predict (vectors.dense ("5.6 7.6 8.9". Split ("). Map (_.todouble ))) -println ("Vector 3.2 3.3 6.6 belongs to Clustering" + clusters.predict (Vectors.dense ("3.2 3.3 6.6". Split ("). Map (_.todouble ))) -println ("Vector 8.1 9.2 9.3 belongs to Clustering" + clusters.predict (vectors.dense ("8.1 9.2 9.3". Split ("). Map (_.todouble ))) -println ("Vector 6.2 6.5 7.3 belongs to Clustering" + clusters.predict (Vectors.dense ("6.2 6.5 7.3". Split ("). Map (_.todouble ))) -  inprintln ("-------Predict the Non-existent line in the analyzed data file:----------------") -println ("Vector 1.1 2.2 3.9 belongs to Clustering" + clusters.predict (Vectors.dense ("1.1 2.2 3.9". Split ("). Map (_.todouble ))) toprintln ("Vector 5.5 7.5 8.8 belongs to Clustering" + clusters.predict (vectors.dense ("5.5 7.5 8.8". Split ("). Map (_.todouble ))) +  -println ("-------Evaluate clustering by computing within Set Sum of squared Errors:-----") theVal Wssse =clusters.computecost (parseddata) *println ("Within Set Sum of squared Errors =" +Wssse) $ sc.stop ()Panax Notoginseng   } -}
View Code

3.4 Running in spark cluster standalone mode

① the idea into a jar package (if forgotten, see Spark: Implementing WordCount in Scala and Java), uploading to the user directory/home/ebupt/test/kmeans.jar

② Prepare the Training sample data: Hdfs://eb170:8020/user/ebupt/kmeansdata, the content is as follows

[Email protected] ~]$ Hadoop fs-cat./kmeansdata

1.0 2.1 3.85.6 7.6 8.93.2 3.3 6.68.1 9.2 9.36.2 6.5 7.3

③spark-submit Commit Run

[Email protected] test]$ spark-submit--master spark://eb174:7077--name kmeanswithmlib--class com.hq.KMeansTest- -executor-memory 2G--total-executor-cores 4 ~/test/kmeans.jar hdfs://eb170:8020/user/ebupt/kmeansdata

Summary of output results:

1 ------Predict The existing line in the analyzed data file:hdfs://eb170:8020/user/ebupt/kmeansdata2 Vector 1.0 2.1 3.8 belongs to Clustering 03 Vector 5.6 7.6 8.9 belongs to clustering 14 Vector 3.2 3.3 6.6 belongs to clustering 05 Vector 8.1 9.2 9.3 belongs to clustering 16 Vector 6.2 6.5 7.3 belongs to clustering 17 -------Predict The non-existent line in the analyzed data file:----------------8 Vector 1.1 2.2 3.9 belongs to clustering 09 Vector 5.5 7.5 8.8 belongs to clustering 1Ten -------Evaluate Clustering by computing within Set Sum of squared Errors:----- OneWithin Set Sum of squared Errors = 16.393333333333388

4.MLbase Summary

This paper mainly introduces how to realize the machine learning algorithm of Mlbase, and briefly introduces the design idea of mlbase. In general, the core of mlbase is ML Optimizer, which transforms declarative tasks into complex learning plans, outputting optimal models and computational results.
Unlike other machine learning systems, Weka and Mahout:

    • The mlbase is distributed, and the Weka is a standalone machine.
    • Mlbase is automated, both Weka and mahout require the user to have machine learning skills to choose the algorithms and parameters they want to handle.
    • Mlbase provides an interface of varying degrees of abstraction to augment the ML algorithm.

5. References

    1. Mlbase
    2. Apache Mlbase
    3. A. Talwalkar, T. Kraska, R. Griffith, J. Duchi, J. Gonzalez, D. Britz, X. Pan, v. Smith, E. Sparks, A. Wibisono, M. J. Fra Nklin, M. I. Jordan. MLBASE:A Distributed machine learning Wrapper. In Big learning Workshop at NIPS, 2012.
    4. Spark Mllib Series--Program framework
    5. Distributed machine learning System in the Mlbase:spark ecosystem
    6. Apache Spark MLlib Kmeans

Introduction to Spark Mlbase Distributed Machine Learning System: Implementing Kmeans Clustering Algorithm with Mllib

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.