How do I make Hadoop a big data analysis with R language?

Source: Internet
Author: User
Keywords Algorithm can similarly data mining

Why let Hadoop combine R language?

R language and Hadoop let us realize that both technologies are powerful in their respective fields. Many http://www.aliyun.com/zixun/aggregation/7155.html "> developers will ask the following 2 questions at the computer's perspective. The problem 1:hadoop family is so powerful, why do you want to combine R language?

Problem 2:mahout can also do data mining and machine learning, and R language difference is what? Here's what I'm trying to do: Question 1:hadoop's family is so powerful, why do you want to combine R language?

A. The power of the Hadoop family is that the processing of large data makes it possible to make the original impossible (TB,PB data).

B. R language is powerful, in statistical analysis, before we have no hadoop, we have large data processing, to take samples, hypothesis testing, to do regression, R language has long been the exclusive tool of statisticians.

C. From point A and b two, we can see that Hadoop focuses on total data analysis, while R language focuses on sample data analysis. The two technologies together, just the longest short!

D. Simulation scenario: Analysis of 1PB news Web Access log to predict future traffic changes

D1: In R language, by analyzing a small amount of data, a regression model is established for the business goal, and the index D2 is defined: using Hadoop to extract the index data from the massive log data d3: Using the R language model to test and tune the index data d4: Using the Hadoop step-by-step algorithm, rewrite the R language model, Deployment of the online scenario, R and Hadoop play a very important role. With the idea of a computer developer, all things are done with Hadoop, without data modeling and proving, "the results of predictions" must be problematic. With the idea of statisticians, all things are done with R, by sampling, the "predicted results" must be problematic. Therefore, the combination of the two is the inevitable direction of the industry, but also the intersection of industry and academia, but also for the interdisciplinary talents to provide unlimited imagination space. Problem 2:mahout can also do data mining and machine learning, and R language difference is what?

A. Mahout is an algorithmic framework for data mining and machine learning based on Hadoop, and the focus of mahout is to solve the problem of calculating large data.

B. Mahout currently supported algorithms include collaborative filtering, referral algorithms, clustering algorithms, classification algorithms, LDA, Naive Bayes, random forests. In the above algorithm, most of the algorithms are distance, can be decomposed by matrix, make full use of MapReduce parallel computing framework, efficient completion of computing tasks.

C. mahout blank, there are many data mining algorithms, it is difficult to achieve mapreduce parallelization. Mahout's existing models, all of which are common models, are used directly in projects where the results are only a little better than random results. Mahout Two development, requires a deep Java and Hadoop technology base, preferably with "linear algebra", "Probability statistics", "Introduction to the algorithm" and other basic knowledge. So it's really not an easy thing to play around Mahout.

The D. R language also provides approximately the majority of algorithms supported by Mahout (except proprietary algorithms), and also supports a large number of mahout unsupported algorithms, and the algorithm grows faster than mahout. and the development of simple, flexible parameter configuration, small dataset operation speed is very fast.

Although Mahout can also do data mining and machine learning, it does not coincide with the area of expertise in R language. Set hundred of the long, in the appropriate areas to choose the right technology, can really "quality and quantity" to do software.

How do I get Hadoop to combine R language?

From the previous section we saw that Hadoop and R languages could complement each other, but the scenarios described were the individual data for Hadoop and R languages. Once the market is in demand, businesses will naturally fill this void.

1). Rhadoop

Rhadoop is a combination of Hadoop and R language, developed by Revolutionanalytics and open source code to the GitHub community. The Rhadoop contains three R packs (Rmr,rhdfs,rhbase), respectively, in the framework of the Hadoop system, MapReduce, HDFS, HBase three parts.

2. Rhiverhive is a tool kit for direct access to hive through the R language, developed by a Korean company in NEXR.

3. Rewrite mahout to rewrite the mahout with R language is also a combination of ideas, I have done related attempts.

4). Hadoop calls R

It's all about how to invoke Hadoop, and of course we can reverse-operate the Java and r connection channels and let Hadoop call R's function. However, this part does not have the business to make the forming product.

5. R and Hadoop in actual case

The combination of R and Hadoop, the technical threshold is still a little high. For a person, not only to master Linux, Java, Hadoop, R technology, but also have software development, algorithms, probability statistics, linear algebra, data visualization, industry background, some of the basic quality. The deployment of this environment in the company also requires a number of departments, the co-ordination of a variety of talents. Hadoop operation, Hadoop algorithm development, R language modeling, R language MapReduce, software development, testing and so on. So there are not too many cases.

Original link: http://www.36dsj.com/archives/6468

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.