R language for Hadoop injection of statistical blood

Source: Internet
Author: User
Keywords Hadoop
Tags access analysis analytics applications based business code company

R is a GNU open Source Tool, with S-language pedigree, skilled in statistical computing and statistical charting. An open source project launched by Revolution Analytics Rhadoop the R language with Hadoop, which is a good place to play R language expertise. The vast number of R language enthusiasts with powerful tools Rhadoop, can be in the field of large data, which is undoubtedly a good news for R language programmers. The author gave a detailed explanation of R language and Hadoop from a programmer's point of view.

The following is the original text:

Objective

wrote several technical articles about Rhadoop, from a statistical perspective, on how to get the R language to handle large data using Hadoop. Today's decision in turn, from a computer developer's perspective, describes how to get Hadoop to combine the R language to do statistical analysis.

Directory

R Language Introduction Hadoop why let Hadoop combine R language?  How do I get Hadoop to combine R language? R and Hadoop in actual case 1. R Language Introduction

Origin

R language, a free software programming language and operating environment, mainly used in statistical analysis, mapping, data mining. R was originally developed by Ross Ihaka and Robert Gentleman from Oakland University in New Zealand (also known as R) and is now being developed by the R Development core team. R is a GNU project based on the S language, so it can also be implemented as an S language. The syntax of R is from scheme.

Cross-platform, license

R's source code is free to download and use, the GNU General Public License, can operate on a variety of platforms, including Unix,linux,windows and OS. R is mainly command-line operations, while supporting GUI graphical user interface.

R's Digital gene

R built a variety of statistical and digital analysis functions. Because of the blood of S, R has more object-oriented functions than any other statistical or mathematical programming language.

Another strength of R is the drawing function, drawing with the quality of printing, but also can add mathematical symbols.

Although R is primarily used for statistical analysis or development of statistical-related software, it is also used as a matrix calculation. Its analysis speed comparable to the GNU Octave even business software matlab.

Code base

Cran is the abbreviation for comprehensive R Archive receptacle. In addition to the collection of R's executable download version, source code and description file, it also includes a variety of user-written software packages. There are more than 100 Cran mirror stations worldwide, with tens of thousands of third-party packages.

R Industry Applications

Statistical analysis, Applied Mathematics, econometric Economics, Financial Analysis, financial Analysis, humanities, data mining, artificial intelligence, bioinformatics, biopharmaceutical, global geographic science, data visualization.

Business competitors

SAS: (Statistical analysis System) is a large-scale integrated modular software system that SAS has launched for data analytics and decision support.

SPSS: (Statistical product and service FX) is a portfolio of software products and related services for statistical analysis, data mining, predictive analysis, and decision support tasks introduced by IBM.

Matlab: (MATrix Laboratory), is a MathWorks company produced a business mathematics software. MATLAB is an advanced technology computing language and interactive environment for algorithm development, data visualization, data analysis and numerical computation.

2. Hadoop Introduction

Hadoop is a familiar technique for computer people.

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without understanding distributed low-level details. Take full advantage of the power of cluster high speed operation and storage. Hadoop implements a distributed filesystem (Hadoop Distributed File System), referred to as HDFs. HDFs is characterized by high fault tolerance and is designed to be deployed on inexpensive (low-cost) hardware. And it provides high transmission rates (throughput) to access application data for applications with large datasets (SCM data set). HDFs relaxes the (relax) POSIX requirements (requirements) so that the data in the file system (streaming access) can be accessed in a streaming form.

Family members of Hadoop: Hive, HBase, zookeeper, Avro, Pig, Ambari, Sqoop, Mahout, Chukwa

Hive: It is a data Warehouse tool based on Hadoop, it can map the structured data file into a database table, quickly realize simple mapreduce statistics through class SQL statement, and it is very suitable for statistic analysis of data Warehouse without developing special MapReduce application. Pig: is a large-scale data analysis tool based on Hadoop, it provides the Sql-like language is called Pig correlation, the language compiler will convert the class SQL data analysis request to a series of optimized processing MapReduce operation. HBase: A highly reliable, high-performance, column-oriented, scalable, distributed storage system that leverages HBase technology to build large structured storage clusters on inexpensive PC servers. Sqoop: A tool used to transfer data from Hadoop and relational databases to the HDFs of Hadoop, with data from a relational database (MySQL, Oracle, Postgres, etc.) HDFs data can also be directed into a relational database. Zookeeper: is a distributed, open source coordination service designed to distribute applications, which is mainly used to solve some data management problems that are often encountered in distributed applications, simplify the coordination and management of distributed applications, and provide high-performance distributed services Mahout: is a distributed framework for machine learning and data mining based on Hadoop. Mahout uses MapReduce to realize partial data mining algorithm, which solves the problem of parallel mining. Avro: is a data serialization system designed to support data-intensive, High-volume data interchange applications. Avro is a new data serialization format and transmission tool that will gradually replace the existing IPC mechanism of Hadoop Ambari: It is a web-based tool that supports the provisioning, management, and monitoring of Hadoop clusters. Chukwa: An Open-source data collection system for monitoring large distributed systems, which collects a variety of types of data into files that are suitable for hadoop processing for various MapReduce operations in HDFS.

Since 2006, Hadoop began its independent development with MapReduce and HDFs, and by the year 2013, the Hadoop family has hatched several top-level projects in Apache. In particular, in the last 1-2 years, the pace of development has become more and more rapid, and the integration of a lot of new technologies (YARN, Hcatalog, Oozie, Cassandra), are a little let us all learn not to come.

3. Why let Hadoop combine R language?

The previous two chapters, R language Introduction and Hadoop introduction, let us realize that the two technologies in their respective areas of power. Many developers ask the following 2 questions at the computer's perspective.

Problem 1:hadoop Family is so powerful, why do you want to combine R language? Problem 2:mahout can also do data mining and machine learning, and R language difference is what?

Below I try to do a solution:

The problem 1:hadoop family is so powerful, why do you want to combine R language?

A. The power of the Hadoop family is that the processing of large data makes it possible to make the original impossible (TB,PB data).

B. R language is powerful, in statistical analysis, before we have no hadoop, we have large data processing, to take samples, hypothesis testing, to do regression, R language has long been the exclusive tool of statisticians.

C. From point A and b two, we can see that Hadoop focuses on total data analysis, while R language focuses on sample data analysis. The two technologies together, just the longest short!

D. Simulation scenario: Analysis of 1PB news Web Access log to predict future traffic changes

D1: In R language, by analyzing a small amount of data, a regression model for business objectives is established, and the index is defined

D2: Using Hadoop to extract metric data from massive log data

D3: The R language model is used to test and tune the index data

D4: Using the Hadoop step-by-step algorithm, rewrite the R language model, deploy the online

In this scenario, both R and Hadoop play a very important role. With the idea of a computer developer, all things are done with Hadoop, without data modeling and proving, "the results of predictions" must be problematic. With the idea of statisticians, all things are done with R, by sampling, the "predicted results" must be problematic.

Therefore, the combination of the two is the inevitable direction of the industry, but also the intersection of industry and academia, but also for the interdisciplinary talents to provide unlimited imagination space.

Problem 2:mahout can also do data mining and machine learning, and R language difference is what?

A. Mahout is an algorithmic framework for data mining and machine learning based on Hadoop, and the focus of mahout is to solve the problem of calculating large data.

B. Mahout currently supported algorithms include collaborative filtering, referral algorithms, clustering algorithms, classification algorithms, LDA, Naive Bayes, random forests. In the above algorithm, most of the algorithms are distance, can be decomposed by matrix, make full use of MapReduce parallel computing framework, efficient completion of computing tasks.

C. mahout blank, there are many data mining algorithms, it is difficult to achieve mapreduce parallelization. Mahout's existing models, all of which are common models, are used directly in projects where the results are only a little better than random results. Mahout Two development, requires a deep Java and Hadoop technology base, preferably with "linear algebra", "Probability statistics", "Introduction to the algorithm" and other basic knowledge. So it's really not an easy thing to play around Mahout.

The D. R language also provides approximately the majority of algorithms supported by Mahout (except proprietary algorithms), and also supports a large number of mahout unsupported algorithms, and the algorithm grows faster than mahout. and the development of simple, flexible parameter configuration, small dataset operation speed is very fast.

Although Mahout can also do data mining and machine learning, it does not coincide with the area of expertise in R language. Set hundred of the long, in the appropriate areas to choose the right technology, can really "quality and quantity" to do software.

4. How do I get Hadoop to combine R language?

From the previous section we saw that Hadoop and R languages could complement each other, but the scenarios described were the individual data for Hadoop and R languages.

Once the market is in demand, businesses will naturally fill this void.

1). Rhadoop

Rhadoop is a combination of Hadoop and R language, developed by Revolutionanalytics and open source code to the GitHub community. The Rhadoop contains three R packs (Rmr,rhdfs,rhbase), respectively, in the framework of the Hadoop system, MapReduce, HDFS, HBase three parts.

2). rhive

Rhive is a tool kit that accesses hive directly through the R language, developed by a Korean company in NEXR.

3). Rewrite mahout

Rewriting Mahout with R language is also a combination of ideas, I have done related attempts.

4). Hadoop calls R

It's all about how to invoke Hadoop, and of course we can reverse-operate the Java and r connection channels and let Hadoop call R's function. However, this part does not have the business to make the forming product.

5. R and Hadoop in actual case

The combination of R and Hadoop, the technical threshold is still a little high. For a person, not only to master Linux, Java, Hadoop, R technology, but also have software development, algorithms, probability statistics, linear algebra, data visualization, industry background, some of the basic quality.

The deployment of this environment in the company also requires a number of departments, the co-ordination of a variety of talents. Hadoop operation, Hadoop algorithm development, R language modeling, R language MapReduce, software development, testing, etc. ...

So there are not too many cases.

Booth future
The combination of R and Hadoop will certainly generate explosive growth in recent years. But because the interdisciplinary will create the technical barrier, the talented person can not keep up with the market demand.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.