Introduction: It is well known that R is unparalleled in solving statistical problems. But R is slow at data speeds up to 2G, creating a solution that runs distributed algorithms in conjunction with Hadoop, but is there a team that uses solutions like python + Hadoop? R Such origins in the statistical computer package and Hadoop combination will not be a problem?

From the knowledge of Wang Frank's answer:

Because they do not understand the characteristics of R and Hadoop application scenarios, just caught a free, open source straw.

R:

R's scenario is not about unparalleled statistical learning but rather unparalleled unit code output under structured data. Neural networks, decision trees and other algorithms based on structured data get a line of code, prediction is only one line of code. In this way, commercial databases (including Oracle, Netezza, Teradata, SAP HANA, etc.) provide R interfaces for efficient implementation by statisticians. Similarly, SAS and IBM SPSS have also been able to implement part of the efficient implementation, they do not have R unique large cranpack group. But in a similar vein, R's package group spoils its users, spoiling them to just think it's a free version of SAS or SPSS, rather than going through code to learn how to do machine learning even a little bit of core principle. What you have to do is to implement the most efficient and up-to-date structured data algorithms.

Most importantly, loading data from Hadoop into these libraries not only guarantees the correctness and structuring of the data itself, but also guarantees the second and third paradigms of the data model (the first lesson of CAErwin), Do any one analysis, the simple join of the database at your hand to form the wide table you need to analyze. Think about the design implications of sum over in SQL: Why does it want to make data redundant? That must be for BI or analytics to exist.

Hadoop:

Hadoop's application scenario is not to provide strong support for statistical analysis software, but to provide a general-purpose free framework for distributed data that efficiently stores raw, unstructured data based on key value pairs.

R + Hadoop, a combination of structured and unstructured databases, looks beautiful, but in reality it's hard to find. My opinion is that any firm that decides to take a solid stance in the area of data analysis (except for the moment Text Text) sees without exception the choice of a traditional structured database as a Followed by the structural analysis of the back - even if they are charged. If you are used to code development, Hadoop + python to do the initial data processing, and then use java-based Mahout is a natural choice.

The illusion of R + Hadoop:

No matter what and Hadoop combination, like word count this kind of key-value pairs began. In fact, R can do this, but I think R's unparalleled, a bit into the misunderstanding. The beauty of phrase R is the unmatched output of unit code under structured data. Once you discover that you are an analyst focused on data and an asylum code developer, starting with R to manipulate lists and data structures and starting to rewrite Mapper and Reducer with R, you have a question:

Why not learn Java, Python? This analysis is "unconventional," even if you do not want to learn it, for those who do not understand them.

Python is based on key-value storage, also has a very high unit code output, there are many scientific calculation package. In this sense you can make a white box, a stand-alone shuffled mahout, and be suitable for big data learning with incremental algorithms (see NumPy, SciPy,). Same as free.

The illusion of data mining:

What is data mining, is it hard?

Generalized data mining, including data analysis and machine learning, only speaks of the most central mathematical concepts, estimates a few words; just R's conciseness can be done in a few words these words:

0 data cleaning, standardization. And 1-4, understanding the real world is complementary

1 The first mathematical technique to learn is Spatial Decomposition: LL ', PCA, SVD, Regression, and L2 / L0 Penalties

2 Learn Optimization Algorithm: L1 Penalty Regression, SVM (Newton-Raphson / Gauss-Newton / Levenberg-Marquadt used!); Monte Carlo Markov Chain

3 Data Structures: Decision Trees (List Classes), Word Frequency Statistics (Key Pair or Dictionary Classes), FP-growth (a Tree Plus). Learn this, the so-called "Bayesian" simply can not be called algorithm, can only be called a guiding ideology.

4 model integration: Adaboost, neural network, bootstrap. The method, the model parameters can be integrated (hodgepodge!)

Any sound-loaded algorithm that can not escape is resolved to the fate of the four combinations.

Can see that the bottleneck in big data analysis?

Step 0, discussed with Big Boss, once the traditional industry data warehouse implementation can hit at least 10 years, and the "entity - relationship" concept and the "key - value" concept of these two abstracts can play at least 30 years, the data Organization, filtering, metadata maintenance are the only way to generate value of the data, this work is very boring but very basic, big data and traditional data are needed;

Step 1 is the most basic and important analysis method, and it is also most likely to lead to a billion-step sparse large matrix that can not be analyzed by a single machine under the context of big data. Example 1: User User's Purchase Record of Commodity SKU; Example 2: Specific latitude and longitude, a specific time, a specific user took the action; These two examples are typical "summary is not as good as not summary" situation, you must have distributed sparse matrix processing technology;

In the second step, the serial MCMC serialization can be simulated by the parallel integration method, but the convergence is still low and the FLOPS needs to be violated in parallel. Correspondingly, because of the incremental algorithm and distributed algorithm scheme in SVM / Lasso, The core idea is that "the reality of the world and the nature of the model are sparse," and lock a small amount of resources to update the model coefficients or the gradient in a distributed way. After being theoretically broken, these algorithms often rely on analytic databases or big data platforms Flexible concurrent scheduling, flexible ranks of mixed storage mode, which is stand-alone, small clusters, traditional databases difficult to match;

Step 3 and 4, although here is a very simple example, but these are the mathematical model and data model is the most under development pressure, need to be concerned about the skills of experienced programmers. For example, you may still have to do PCA (or other forms of big-matrix processing) in NLP statistical terms; if you only introduce the HMM model and the underlying dictionary tree, the learning costs are only learned from Bayesian Theory, and still can effectively solve the NLP problem in parallel. It is interesting to refer to the Viterbi algorithm and the CRF algorithm.

The illusion of Big Data: conflicts between storage and computing

Big data processing, how much is big? Like I said, the data coming out in steps 3 and 4, the raw data is very large, the processing of the summary is very small, or dealt with is highly independent. Distributed storage does not affect the analysis, that is big data, in fact, and no difference between small data processing.

Clustering, regression, SVD, PCA, QR, LU, etc., which require exchanging resources at any time, are computationally efficient or even efficient access to matrix decomposition, which is the real challenge for big data.

Those supervised classification trees, which cut the data set into 1000 and redundant to 3-5 machines for each of the 500 machines, finally get integrated classification results, which I can hardly call "big data computing technology" Its essence is the same as a mining machine doing countless highly homogenized hash calculations per second, eliminating the need for resource exchange, requiring large amounts of communication, and just spinning around small data.

In-memory analysis and data exploration, show (single node):

Million, R ceiling; Thousands - 100 million, SAS ceiling; 100 million, Python ceiling. My experience is that 400M data, Python load memory 500M, R load memory 2G, SAS load 600M, after table-level compression 150M. The follow-up of the original treatment (especially string manipulation of such data cleaning), R almost impossible to do. This requires you to enter the R data almost can start running analysis. If you do not believe it, I recommend using readLines plus strsplit to read R files, to see how efficient it is to clean data, read.delim, and SAS proc import, and Python's as syntax.

On the other hand, R provides the best presentation as long as the amount of data is below the limit just mentioned, because "the presentation of the program is specific rather than generic". The most famous ggplot2, based on the Baidu echarts products recharts (by taiyun on Github), and Yihui Xie's work knitr (markdown syntax dynamic data mining results, images, video generated html), than the existing visual Python package Even GUI package) is more friendly, easy to operate, more suitable for small data sets to quickly show. If you happen to be a SAS user, do not say that knowing your data will be better.

My understanding is that R's output is similar to html + js + CSS, suitable for lightweight analysis, lightweight display, more suitable for individual users.

Unstructured big data processing:

Your algorithm has come to a "well-prepared, just run the whole amount of" such a rival's data is well understood. Wiki introduction to Revolution Analytics: R did not natively handle datasets larger than main memory. Unstructured big data applications can only be the scene:

- You know the details of the data distribution (maybe you spent a long time on other projects or the sample data of this project have also been a little verified in the big data set)

- What algorithm is right for you, know the incremental algorithm exists, or you think violence parallel No problem

- Do you think it is okay to pass a Mahout-like calculation to R via code wrapper?

- You do not care about interactive exploration at all

Is this the R application scenario you need? Or put another way, what are the advantages of this application scenario? To know the algorithm efficiency ranking R <java <C + +. Algorithm months to go online, I looked at doing.

Said the despicable team (a data mining department is not specialized in data mining) experience:

Speaking for a long time R + Hadoop, not Mahout, casually engage in RSnow, ready to buy SAS.

Because I would SAS (a small amount of Macro, no matrix used, no need) and R (no learning cost), python parallel package pp use, consider mahout.

UPDATE: While big data platform users are not content with storage, simple processing, and shaping algorithms, they are starting to focus on minimal queries and interactive exploration of efficiencies, such as Spark's memory solutions.

Conclusion:

By the way also to data analysts, and their leaders to remind one: If A has no code B's ability to develop, R completely for B to complete the mathematical things, forming a dependency, then the existence of B what is the meaning? People emphasize the mathematical theory so little advantage also ceased to exist. Machine learning algorithms are adapted to different tools at different stages of research and use, even if it is not, even do not understand the environment for the appropriate tools, as Internet practitioners, this is too embarrassing.

In the United States, elite researchers do their own development - so to speak, elite developers do their own research. Each model is not perfect, the existing model is likely to not meet your analytical needs. Therefore, we should adopt an open mind to accept new technologies, develop in-depth data mining research, from code optimization (cottage) to technology original.