Rhadoop is an open source project initiated by Revolution Analytics, which combines statistical language R with Hadoop. Currently, the project consists of three R packages, the RMR that support the use of R to write MapReduce applications , Rhdfs for the R language to access HDFs, and for R language Access The rhbase of HBase . Download URL for https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads.
Note: The following record is the summary after successful installation, the intermediate process description and resolution method may not be accurate (marked in red), for reference only. The server operating system is CentOS 5.6.
First, the software version
R 2.13.1,hadoop cluster (CDH3),JDK1.6.
Second, the installation node
where rhbase and Rhdfs are installed on the namenode of the Hadoop cluster , and RMR need to be installed on each node of the cluster.
Third, installation
Due to network restrictions, you can only download the source files locally, and then install them through the shell command R-CMD install ' package_name '.
A) Install Rhdfs first. The package depends on the package Rjava. So you need to download the Rjava source code and install it first.
R CMD INSTALL ' rjava_0.9-3.tar.gz '
R CMD INSTALL ' rhdfs_1.0.1.tar.gz '
When you perform an installation Rjava, you may fail with the error message "checking whether JNI programs can be compiled ... configure:error:Cannot compile a simpl e JNI program. See Config.log for details. "This may be due to a version problem with the JDK and it is recommended that you install jdk1.6.
b) Install the RMR. The package depends on the package Rjsonio, Itertools,Digest, and the package itertools is dependent on iterators.
R CMD INSTALL ' iterators_1.0.5.tar.gz '
R CMD INSTALL ' itertools_0.1-1.tar.gz '
R CMD INSTALL ' rjsonio_0.96-0.tar.gz '
R CMD INSTALL ' digest_0.5.1.tar.gz '
R CMD INSTALL ' rmr_1.1.tar.gz '
c) Install the rhbase (see https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase). Before installing rhbase, you will also need to install the Thrift Library, which is recommended to install Thrift 0.6.1, download URL is http://thrift.apache.org/. The detailed installation steps are as follows:
I. enter the shell command under the CentOS system sudo yum install automake Libtool Flex Bison pkgconfig gcc-c++ boost-devel Libevent-devel lib-devel python-devel ruby-devel, install some thrift related tools or libraries. Due to the problem of network connection, in the attempt is not fully installed, the personal feel there is no need to fully install, just guarantee g++ 3.3.5 or above,boost 1.33.1 or later.
II. Unzip and install the Thrift.
TAR-ZVXF thrift-0.6.1.tar.gz
CD thrift-0.6.1
./configure--with-boost=/usr/include/boost Javac=/usr/jdk1.6/bin/javac
Make
Make install
the values of-with-boost and Javac are modified according to the actual situation of the server (not sure if the Javac setting is required).
III. Set the environment variable de>pkg_config_pathde>de>. De>
Enter in/etc/profile
Export de>pkg_config_path= $PKG _config_path:/usr/local/lib/pkgconfig/de>de> (and by command de>de>souce/etc/ The profilede>de> command makes the environment variable effective). De> after entering the shell command de>pkg-config--cflags thriftde>de> Verifythat the de> pkg-config path is set correctly and the return result is de> -i/usr/local/include/thriftde>de> is successful. de>
IV. Copy the library file.
Cp/usr/local/lib/libthrift.so.0/usr/lib
V. Install the rhbase.
R CMD INSTALL ' rhbase_1.0.1.tar.gz '
Iv. Verification and testing
entering the library (RMR), library (RHDFS), library (rhbase) on the R command line means theinstallation is successful.
Test Case: Use the MapReduce implementation and function sapply the same function.
de> normal R code:de>
de>groups = Rbinom (+, n = +, prob = 0.4) de>
De>tapply (groups, groups, length)
De>
the R code:de> implemented using MapReduce
De>
de>groups = To.dfs (groups) de> (the previous groups is still used to ensure the same data)
De>From.dfs (mapreduce (input = groups, map = function (k,v) keyval (V, NULL), reduce = function (K,VV) keyval (k, length (VV)))) De>
Five, more reference sites:
Https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
Https://github.com/RevolutionAnalytics/RHadoop/wiki/Writing-composable-mapreduce-jobs
Https://github.com/RevolutionAnalytics/RHadoop/wiki/Efficient-rmr-techniques
Run R program on a Hadoop cluster--Install Rhadoop