Run R program on a Hadoop cluster--Install Rhadoop

Source: Internet
Author: User

Rhadoop is an open source project initiated by Revolution Analytics, which combines statistical language R with Hadoop. Currently, the project consists of three R packages, the RMR that support the use of R to write MapReduce applications , Rhdfs for the R language to access HDFs, and for R language Access The rhbase of HBase . Download URL for https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads.

Note: The following record is the summary after successful installation, the intermediate process description and resolution method may not be accurate (marked in red), for reference only. The server operating system is CentOS 5.6.

First, the software version

R 2.13.1,hadoop cluster (CDH3),JDK1.6.

Second, the installation node

where rhbase and Rhdfs are installed on the namenode of the Hadoop cluster , and RMR need to be installed on each node of the cluster.

Third, installation

Due to network restrictions, you can only download the source files locally, and then install them through the shell command R-CMD install ' package_name '.

A) Install Rhdfs first. The package depends on the package Rjava. So you need to download the Rjava source code and install it first.

R CMD INSTALL ' rjava_0.9-3.tar.gz '

R CMD INSTALL ' rhdfs_1.0.1.tar.gz '

When you perform an installation Rjava, you may fail with the error message "checking whether JNI programs can be compiled ... configure:error:Cannot compile a simpl e JNI program. See Config.log for details. "This may be due to a version problem with the JDK and it is recommended that you install jdk1.6.

b) Install the RMR. The package depends on the package Rjsonio, Itertools,Digest, and the package itertools is dependent on iterators.

R CMD INSTALL ' iterators_1.0.5.tar.gz '

R CMD INSTALL ' itertools_0.1-1.tar.gz '

R CMD INSTALL ' rjsonio_0.96-0.tar.gz '

R CMD INSTALL ' digest_0.5.1.tar.gz '

R CMD INSTALL ' rmr_1.1.tar.gz '

c) Install the rhbase (see https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase). Before installing rhbase, you will also need to install the Thrift Library, which is recommended to install Thrift 0.6.1, download URL is http://thrift.apache.org/. The detailed installation steps are as follows:

I. enter the shell command under the CentOS system sudo yum install automake Libtool Flex Bison pkgconfig gcc-c++ boost-devel Libevent-devel lib-devel python-devel ruby-devel, install some thrift related tools or libraries. Due to the problem of network connection, in the attempt is not fully installed, the personal feel there is no need to fully install, just guarantee g++ 3.3.5 or above,boost 1.33.1 or later.

II. Unzip and install the Thrift.

TAR-ZVXF thrift-0.6.1.tar.gz

CD thrift-0.6.1

./configure--with-boost=/usr/include/boost Javac=/usr/jdk1.6/bin/javac

Make

Make install

the values of-with-boost and Javac are modified according to the actual situation of the server (not sure if the Javac setting is required).

III. Set the environment variable de>pkg_config_pathde>de>. De>

Enter in/etc/profile

Export de>pkg_config_path= $PKG _config_path:/usr/local/lib/pkgconfig/de>de> (and by command de>de>souce/etc/ The profilede>de> command makes the environment variable effective). De> after entering the shell command de>pkg-config--cflags thriftde>de> Verifythat the de> pkg-config path is set correctly and the return result is de> -i/usr/local/include/thriftde>de> is successful. de>

IV. Copy the library file.

Cp/usr/local/lib/libthrift.so.0/usr/lib

V. Install the rhbase.

R CMD INSTALL ' rhbase_1.0.1.tar.gz '

Iv. Verification and testing

entering the library (RMR), library (RHDFS), library (rhbase) on the R command line means theinstallation is successful.

Test Case: Use the MapReduce implementation and function sapply the same function.

de> normal R code:de>
de>groups = Rbinom (+, n = +, prob = 0.4) de>
De>tapply (groups, groups, length)
De>

the R code:de> implemented using MapReduce
De>

de>groups = To.dfs (groups) de> (the previous groups is still used to ensure the same data)

De>From.dfs (mapreduce (input = groups, map = function (k,v) keyval (V, NULL), reduce = function (K,VV) keyval (k, length (VV)))) De>

Five, more reference sites:

Https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial

Https://github.com/RevolutionAnalytics/RHadoop/wiki/Writing-composable-mapreduce-jobs

Https://github.com/RevolutionAnalytics/RHadoop/wiki/Efficient-rmr-techniques

Run R program on a Hadoop cluster--Install Rhadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.