Original address: https://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice4/
Introduction
I believe that many computer practitioners will be excited about this technical direction by bringing machine learning. However, learning and using machine learning algorithms to process data is a complex task, requiring sufficient knowledge reserves, such as probability theory, mathematical statistics, numerical approximation, optimization theory and so on. Machine learning aims to make computers have the same learning and imitation abilities as human beings, which is the core idea and method of realizing artificial intelligence. Traditional machine learning algorithms, due to the limitations of technology and stand-alone storage, can only be used on a small amount of data, and with the advent of distributed file systems such as HDFS (Hadoop Distributed File system), storing massive amounts of data is possible. However, due to the limitations of MapReduce itself, the use of MapReduce to implement distributed machine learning algorithms is time consuming and consumes disk capacity. Because the machine learning algorithm parameter learning process is iterative calculation, that is, the results of this calculation as the next iteration of the input, in this process, if using MapReduce, we can only store the intermediate results disk, and then the next time the calculation of the new read, this for the iteration The frequent algorithm is obviously a fatal performance bottleneck. Spark based on in-memory computing, natural adaptation to iterative computing, I believe that for this, readers through the previous articles have a more in-depth understanding. Even so, for the average developer, implementing a distributed machine learning algorithm is still a challenging task. MLlib is designed to make machine learning based on massive data simpler, it provides a distributed implementation of commonly used machine learning algorithms, developers only need to have the Spark base and understand the principle of the machine learning algorithm, as well as the meaning of the relevant parameters of the method, it can be easily by invoking the appropriate API To implement a machine learning process based on massive data. Of course, raw data ETL, feature index extraction, tuning parameters and optimizing the learning process, which still need to have enough industry knowledge and data sensitivity, which is often the embodiment of experience. The focus of this paper is to introduce to the reader how to use the K-means algorithm provided by MLlib Machine Learning Library to do cluster analysis, which is a meaningful process, which is believed to be instructive to readers, especially beginners.
Back to top of page
Introduction to the Spark Machine learning Library
The Spark Machine Learning Library provides the implementation of commonly used machine learning algorithms, including clustering, classification, regression, collaborative filtering, and dimension reduction. Using the Spark machine learning Library to do machine learning can be very simple, usually only after processing the raw data and then invoking the appropriate API directly to implement it. But to choose the right algorithm for efficient and accurate analysis of the data, you may also need to delve into the principles of the algorithm and the meaning of the parameters implemented by the Spark MLlib API.
It should be mentioned that the Spark machine learning Library is divided into two packages from the 1.2 release, namely:
Spark MLlib has a long history and has been included in the previous version of 1.0, and the algorithm implementations are based on the original RDD, which is actually easier to learn from the perspective of learning. If you already have experience with machine learning, you just need to familiarize yourself with the next MLlib API to begin data analysis. It is difficult to build a complete and complex machine learning pipeline based on the tools provided by this package.
Spark ML Pipeline, starting with the Spark1.2 version, has now graduated from the Alpha phase to become a usable and stable new machine learning library. ML Pipeline makes up for the shortcomings of the original MLlib library, provides users with a DataFrame-based machine learning Workflow API suite, using the ML Pipeline API, we can easily combine data processing, feature conversion, regularization, and multiple machine learning algorithms Up and build a single, complete machine learning pipeline. Obviously, this new approach provides us with a more flexible approach, and it is more in line with the characteristics of the machine learning process.
From the official documentation, Spark ML Pipeline is the recommended machine learning approach, but does not replace the original MLlib library in the short term, as MLlib already contains a rich and stable algorithm implementation, and some ML Pipeline implementations are based on MLlib. And as far as I am concerned, not all machine learning processes need to be built into a pipeline, sometimes the original data format is neat and complete, and using a single algorithm to achieve the goal, we do not need to complicate things, the simplest and easy to understand the way is the right choice.
This article, based on Spark 1.5, shows the reader the process of clustering using the MLlib API. Readers will find that using the MLlib API to develop a machine learning application is relatively simple, I believe this article will enable readers to build confidence and master the basic methods, in order to learn and work in the follow-up.
Back to top of page
Principle of K-means Clustering algorithm
Cluster analysis is a unsupervised learning (unsupervised learning) process, which is generally used to group data objects according to their characteristic attributes, and is often used in customer clustering, fraud detection, image analysis and other fields. K-means should be the most famous and most commonly used clustering algorithm, its principle is relatively easy to understand, and the clustering effect is good, has a wide range of use.
Like many machine learning algorithms, the K-means algorithm is an iterative algorithm with the following main steps:
- The first step is to select K points as the initial cluster center.
- The second step is to calculate the distance from all remaining points to the center of the cluster and divide each point into the cluster where it is closest to the cluster center. Here, the measurement distance generally has multiple functions can be selected, the most commonly used is Euclidean distance (Euclidean Distance), also known as European distance. The formula is as follows:
Where C represents the center point and X represents any non-center point.
- The third step is to recalculate a bit of the average in each cluster and use it as the new cluster center point.
- Finally, repeat (ii), (iii) step the process until the cluster center no longer changes, or the algorithm reaches a predetermined number of iterations, or the change in the center of the cluster is less than the pre-set threshold.
In practical application, the K-means algorithm has two problems that have to be confronted and overcome.
- The selection of the number of clusters K. The choice of K is a relatively learned and fastidious step, and we will describe later in this article how to use the tools offered by Spark to choose K.
- The selection of the initial cluster center point. Choosing a different cluster center may lead to differences in clustering results.
The implementation of the Spark MLlib K-means algorithm is based on the selection of the initial clustering point, which is borrowed from a named k-means| | Implementation of the class k-means++. The k-means++ algorithm follows a basic principle in initial point selection: The distance between the initial cluster center points should be as far away from each other as possible. The basic steps are as follows:
- The first step is to randomly select a point from the DataSet X as the first initial point.
- The second step is to calculate the distance D (x) from all points in the dataset to the center point of the most recent selection.
- The third step, select the next center point, make the maximum.
- Fourth, repeat (ii), (iii) step process until the K initial point selection is complete.
Back to top of page
K-means realization of MLlib
The implementation class (Kmeans.scala) of the K-means algorithm in Spark MLlib has the following parameters, as described below.
Figure 1. MLlib K-means algorithm Implementation class preview
With the following default constructor, we can see that these tunable parameters have the following initial values.
Figure 2. MLlib K-means algorithm Parameter initial value
The meanings of the parameters are explained as follows:
- k indicates the number of clusters expected.
- Maxinterations represents the maximum number of iterations for a single run of a method.
- runs indicates the number of times the algorithm has been run. The K-means algorithm does not guarantee the return of the global optimal clustering result, so running the K-means algorithm multiple times on the target data set can help to return the best clustering results.
- Initializationmode represents the selection of the initial cluster center point and currently supports random selection or k-means| | Way. The default is K-means| |.
- initializationsteps means k-means| | The number of parts in the method.
- Epsilon represents the threshold for iterative convergence of the K-means algorithm.
- seed represents a random seed at cluster initialization.
Typically, we call the Kmeans.train method to cluster the dataset first, and this method returns the Kmeansmodel class instance, and we can also use the Kmeansmodel.predict method to predict the clustering of the new data points. This is a very useful feature.
The Kmeans.train method has many overloaded methods, and here we choose one of the most complete displays of the parameters.
Figure 3. Kmeans.train Method Preview
The Kmeansmodel.predict method accepts different parameters, either a vector, or an RDD, that returns the index number of the cluster to which the incoming parameter belongs.
Figure 4. Kmeansmodel.predict Method Preview
Back to top of page
Introduction to clustering test data sets
In this article, we use the target dataset as the wholesale customer data set from the UCI machine learning Repository. UCI is a Download Center site for machine learning test data that contains datasets for clustering, grouping, regression, and other machine learning problems.
Wholesale Customer Data Set is the annual consumption of customers in a variety of categories of products that reference a wholesale reseller. In order to facilitate processing, the original CSV format was converted into two text files, respectively, training data and test data.
Figure 5. Customer Consumption data Format preview
Readers can clearly see the meaning of each column from the headline, and of course the reader can go to the UCI website to find out more about the dataset. Although UCI's data is freely available and available, we hereby declare that the data set is owned by UCI and its original provider organization or company.
Back to top of page
Case studies and coding implementations
In this example, we will consider each column as a feature indicator based on the consumer data of the target customer and cluster the dataset. The code implementation steps are as follows
Listing 1. Cluster analysis to implement class source code
Import Org.apache.spark. {sparkcontext, Sparkconf}import org.apache.spark.mllib.clustering. {Kmeans, Kmeansmodel}import org.apache.spark.mllib.linalg.Vectorsobject kmeansclustering {
def main (args:array[string]) {
if (Args.length < 5) {
println ("usage:kmeansclustering trainingdatafilepath testdatafilepath numclusters numiterations runTimes")
Sys.exit (1)
}
Val conf = new sparkconf (). Setappname ("Spark MLlib Exercise:k-means Clustering")
Val sc = new Sparkcontext (conf)
/**
*channel Region Fresh Milk Grocery Frozen detergents_paper Delicassen
* 2 3 12669 9656 7561 214 2674 1338
* 2 3 7057 9810 9568 1762 3293 1776
* 2 3 6353 8808 7684 2405 3516 7844
*/
Val rawtrainingdata = sc.textfile (args (0))
Val parsedtrainingdata = Rawtrainingdata.filter (!iscolumnnameline (_)). Map (line = {
Vectors.dense (Line.split ("\ T"). Map (_.trim). Filter (! "". Equals (_)). Map (_.todouble))
}). Cache ()
Cluster the data into the classes using Kmeans
Val numclusters = args (2). ToInt
Val numiterations = args (3). ToInt
Val runtimes = args (4). ToInt
var clusterindex:int = 0
Val Clusters:kmeansmodel = Kmeans.train (Parsedtrainingdata, Numclusters, Numiterations,runtimes)
println ("Cluster number:" + clusters.clusterCenters.length)
println ("Cluster Centers Information Overview:")
Clusters.clusterCenters.foreach (x = {
println ("Center Point of Cluster" + ClusterIndex + ":")
println (x)
ClusterIndex + = 1
})
Begin to check which cluster all test data belongs to based on the clustering result
Val rawtestdata = Sc.textfile (args (1))
Val parsedtestdata = Rawtestdata.map (line = + {
Vectors.dense (Line.split ("\ T"). Map (_.trim). Filter (! "". Equals (_)). Map (_.todouble))
})
Parsedtestdata.collect (). foreach (Testdataline = {
Val predictedclusterindex:int = clusters.predict (testdataline)
println ("The data" + testdataline.tostring + "belongs to cluster" + Predictedclusterindex)
})
println ("Spark MLlib K-means Clustering test finished.")
}
Private def iscolumnnameline (line:string): Boolean = {
if (line! = null && line.contains ("Channel")) True
else false
}
The sample program accepts five entry parameters, namely
- Training Data Set File path
- Test Data Set File path
- Number of clusters
- Number of iterations of the K-means algorithm
- Number of K-means algorithm run
Back to top of page
Run the sample program
As with other articles in this series, we still choose to use HDFS to store data files. Before running the program, we need to upload the training and test data set mentioned earlier to HDFS.
Figure 6. HDFS Directory for test data
Listing 2. Sample program Run command
./spark-submit--class com.ibm.spark.exercise.mllib.KMeansClustering --master spark://<spark_master_node_ ip>:7077 --num-executors 6--driver-memory 3g--executor-memory 512m--total-executor-cores 6 /home/fams/ Spark_exercise-1.0.jar Hdfs://
Figure 7. K-means Clustering Sample program run resultsBack to top of page
How to choose KThe previous choice of K is the key to the K-means algorithm, and Spark MLlib provides a computecost method in the Kmeansmodel class that evaluates the clustering effect by calculating the sum of squares of all data points to its nearest center point. In general, the same number of iterations and the number of times the algorithm runs, the smaller the value, the better the effect of clustering. However, in the actual situation, we also have to consider the results of the interpretation of clustering, can not blindly choose the computecost result value of the smallest K.
Listing 3. K Select Sample Code SnippetArray (3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20) Ks.foreach (cluster = {val Model:kmeansmodel = Kmeans. Trainprintln("Sum of squared distances of points to their nearest center when k=" + cluster + "+" + SSD )})
Figure 8. K Select sample program Run resultsFrom the running results can be seen, when the k=9, the cost value fluctuations, but gradually reduced, so we choose 8 this critical point as the number of K. Of course, you can run a few more times to find a stable K-value. Theoretically, the higher the value of K, the smaller the cost of clustering, the limit of the case, each point is a cluster, the cost is 0, but obviously this is not a meaningful clustering results.
Back to top of page
ConclusionThrough this study, the reader has a preliminary understanding of Spark's machine learning Library, and mastered the basic principles of the K-means algorithm, and how to build their own machine learning applications based on Spark MLlib. The construction of machine learning applications is a complex process, and we often need to preprocess the data, then feature extraction and data cleansing, before we can use the algorithm to analyze the data. Spark MLlib differs from traditional machine learning tools because it provides an easy-to-use API and, more importantly, the efficiency of Spark's handling of big data and the unique benefits of iterative computing. Although the test data set used in this paper is small and does not reflect the application scenario of big data, it is sufficient to master the basic principles, and if the reader has a larger data set, it is easy to generalize the test procedure of this article to the Big data clustering scenario, because the Spark MLlib programming model is consistent, Nothing is more than a slightly different way of reading and processing data. I hope that readers can find their own knowledge of interest in this article, I believe this will be helpful for readers to further study in the future. In addition, the reader in the process of reading this article, if you encounter problems or find deficiencies, please advise, at the end of the text message, mutual exchange of learning, thank you.
K-means cluster analysis using Spark MLlib [go]