This article has two purposes, one is in the performance of Fast contrast R language and Spark, the second is to introduce you to spark's machine learning Library
Background Introduction
Because the R language itself is single-threaded, it might not be wise to compare spark and r from performance. Even though the comparisons are not ideal, some of the numbers below will be of interest to those who have encountered these problems.
Have you ever thrown a machine-learning problem into R and waited for hours? And just because there is no viable alternative, you can only wait patiently. So it's time to take a look at Spark's machine learning, which contains most of the functions of the R language and outperforms the R language in data transformations and performance.
I have tried to use different machine learning techniques--r language and spark machine learning to solve the same specific problem. In order to increase comparability, I even let them run on the same hardware environment and operating system. Also, run stand-alone mode in spark without any configuration of the cluster.
Before we discuss the specifics, there is a simple explanation about revolution R. As an enterprise version of the R language, Revolution R attempts to compensate for the single-threaded flaw in the R language. But it can only run on proprietary software like Revolution Analytics, so it may not be the ideal long-term solution. If you want to get an extension of Microsoft Revolution Analytics software, it can complicate things, such as licensing issues.
As a result, community-supported open source tools, like spark, may be a better choice than the R Language Enterprise Edition.
Datasets and Issues
Analysis using the Kaggle website [Translator Note: Kaggle is a data analysis of the competition platform, URL: https://www.kaggle.com/on the digital recognizer data set, which contains gray-scale handwritten numbers of pictures, from 0 to 9.
Each picture is 28px high, 28px wide and 784px in size. Each pixel contains a value that is dark about the pixel, and the higher the value, the darker the pixel point. The pixel value is an integer between 0 and 255, including 0 and 255. The entire picture contains the first column in a total of 785 columns of data, called "tags", that is, user-handwritten numbers.
The goal of the analysis is to get a model that can recognize numbers from pixel values.
The argument for selecting this dataset is that, in terms of data volume, this is not really a big data problem.
Contrast Situation
For this problem, the machine learning steps are as follows to conclude the predictive model:
- Principal component analysis and linear discriminant analysis are performed on the data set, and the main features are obtained. (The step of characteristic engineering) [Translator Note: Baidu Encyclopedia portal, principal component analysis, linear discriminant analysis].
- The two-element logistic regression of all double-digit numbers is carried out and classified according to their pixel information, principal component analysis and the characteristic variables obtained by linear discriminant analysis.
- The multivariate logistic regression model is run on the whole amount of data for multi-class classification. Based on their pixel information and principal component analysis and the characteristic variables of linear discriminant analysis, the naive Bayesian classification model is used to classify them. Use decision tree classification model to classify numbers.
Before the above steps, I have divided the tagged data into training groups and test groups to train the model and verify the performance of the model in precision.
Most of the steps are running on both the R language and the spark. The detailed comparison is as follows, mainly comparing the principal component analysis, the two-yuan logic model and the naïve Bayesian classification model.
principal Component Analysis
The main computational complexity of principal component analysis is in the grading of components, the logical steps are as follows:
- The weight value of KXM is obtained by traversing the data and calculating the covariance table of each column. (k represents the number of principal components, and M represents the number of characteristic variables of the dataset).
- When we rate n data, it is the matrix multiplication operation.
- By NXM the dimension data and mxk weight data, the NXK principal component is obtained. Each of the n data has a K principal component.
In our example, the result of the score is that the dimension matrix of 42000 x 784 is multiplied by the matrix of 784 x 9. Frankly, this calculation process has been running in R for more than 4 hours, while the same operation Spark took only 10 seconds
Matrix multiplication is almost 300 million operations or instructions, and there are quite a few search and find operations, so Spark's parallel computing engine can be completed in 10 seconds or very surprising.
I verified the accuracy of the resulting principal component by looking at the variance of the first 9 principal components. Variance coincides with the variance of the first 9 principal components produced by R. This ensures that spark does not sacrifice precision in exchange for performance and data conversion benefits.
Logistic regression model
Unlike principal component analysis, in a logistic regression model, the operations of training and scoring are calculated and are extremely dense operations. The Common data Training scheme for this model includes some transpose and inverse for the entire data set matrix.
Due to the complexity of the calculations, R will take a while to complete in training and scoring, exactly 7 hours, and spark only takes about 5 minutes.
Here I ran the two-dollar logistic regression model on 45 double-digit numbers from 0 to 9, and the score/verification was carried out on the 45 test data.
I also executed the multivariate logistic regression model in parallel, and as a multi-class classifier, it was done in about 3 minutes. And that doesn't work on r, so I can't compare the data.
For principal component analysis, I use the AUC value [Translator Note: The value of AUC is to calculate the area below the ROC curve, which is a standard for measuring the classification model. ] To measure the prediction model's performance on 45 of data, while Spark and R both run the same AUC value as the model results.
naive Bayesian classifier
Unlike principal component analysis and logistic regression, naive Bayesian classifiers are not dense-computational. It is necessary to calculate the prior probability of the class, and then obtain the posteriori probability based on the available additional data. [Translator Note: a priori probability is a probability based on past experience and analysis, it is often used as the "cause" in the problem of "due to"), and the posterior probability refers to the probability of re-correcting after obtaining the "result", which is the "fruit" in the problem of "fruit seeking". ]
As shown, R takes about more than 45 seconds to complete, while Spark takes only 9 seconds. As before, the accuracy of the two is matched.
I also tried to run a decision tree model with the spark machine, which took about 20 seconds, and this completely didn't work on R.
Spark Machine learning Getting Started Guide
The contrast is sufficient, and this has also been achieved by Spark's machine learning. It is best to start by learning it from the Programming Guide. However, if you want to try it out early and learn from it, it may take a while to get it running.
To figure out the sample code and experiment on the dataset, you need to understand the basic framework and operations supported by Spark's RDD [translator Note: rdd,resilient distributed datasets Elastic distributed datasets]. Then you have to figure out the different machine learning programs in spark and program them on top. When your first spark machine learning program runs, you may be down-hearted.
The following two documents can help you avoid these problems and straighten out your learning ideas:
- The spark machine learns all the source code and can provide anyone with a comparison with the R language: https://github.com/vivekmurugesan/experiments/tree/master/spark-ml
- The source code of the Docker container, spark and the package of the above items are pre-provisioned for quick implementation: Apache is already installed in the Https://hub.docker.com/r/vivekmurugesan/spark-hadoop/ocker container Hadoop, and runs in a pseudo-distributed environment. This can be used to test spark by putting large-capacity files into a distributed file system. By loading records from a distributed file system, you can easily create an RDD instance.
Capacity and Accuracy
People will use different indicators to measure the quality of these tools. For me, precision and productivity are decisive factors.
Everyone always likes r more than Spark machine learning, because of the experiential learning curve. Ultimately, they can only choose to use a small amount of sample data on R because R spends too much time on a sample of big data, which also affects the performance of the system as a whole.
For me, using a small amount of sample data is not going to solve the problem, because a few samples simply don't represent the whole (at least in most cases). So, if you use a small sample, you choose a compromise in precision.
Once you discard a few samples, it boils down to the problem of production performance. The problem of machine learning is essentially an iterative one. If it takes a long time for each iteration, the completion time will be extended. However, if you use only a little time for each iteration, there will be more time left for you to hit the code.
Conclusion
The R language contains a library of statistical calculations and a library of visual analysis such as GGPLOT2, so it cannot be completely discarded, and the ability to dig up data and aggregate statistics is beyond doubt.
However, when it comes to the problem of building models on big datasets, we should dig up some tools like spark ml. Spark also provides packages for R, Sparkr can apply R on distributed datasets.
It's a good idea to put more tools in your "data barracks" because you don't know what you're going to encounter when you "fight". So it's time to step into the new era of spark ml from the past R era.
Big Data Tools Comparison: R language and Spark who's better?