A preliminary attempt at Hadoop and Rhadoop

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://chen.yi.bo.blog.163.com/blog/static/1506211092012720111910827/

Write a diary when you return from running. The recent study of the use of Hadoop, initially to write a bit of hive's statistical script, later felt that the mechanism of not understanding Hadoop could not be optimized for scripting, so it was hard to start learning. As we all know, I am a computer small white, directly on the cluster operation, I always worry about the machine to break down, so I risked a small risk of my work computer WIN7 system removed, with Ubuntu to run pseudo-distributed Hadoop. The two days have been handy to try out the RMR package and the Rhdfs package in the Rhadoop project. As a novice, the learning process encountered a lot of pits, before the knowledge of R is mostly not able to transplant to the Linux platform, so I made a note. Nothing interesting, just tidy up a memo.
1. Operating systemI am using Ubuntu 12.04 Desktop version of the system. It was very early to hear that this system is more useful for getting started, and finally tried one. It is quite useful, especially the installation is very convenient, can be directly installed in the U disk, the need to reinstall God Horse is also very convenient. Some of the wallpapers are very beautiful.
2. Java EnvironmentThe first is the use of 1.7 of the JDK, later loaded Rjava bag is not loaded, seemingly can only use 1.6 version. The official website can be downloaded directly, Jdk-6u34-linux-x64.bin. Then it is extracted to the specified path below. sudo cp jdk-6u34-linux-x64.bin/usr/lib/jvm CD/USR/LIB/JVM sudo chmod +x jdk-6u34-linux-x64.bin sudo./jdk-6u34-linux-x6 4.bin Next you need to configure the environment variables. There seem to be several ways, I compare, choose to modify/etc/environment this file. sudo vim/etc/environment path= "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/ Jvm/jdk1.6.0_34/bin "Java_home="/usr/lib/jvm/jdk1.6.0_34 "jre_home=" $JAVA _home/jre "classpath=".: $JAVA _home/lib:$ Jre_home/lib "Next is the Java version you choose to use. sudo update-alternatives--install/usr/bin/java java/usr/lib/jvm/jdk1.6.0_34/bin/java sudo update-alternatives-- Install/usr/bin/javac Javac/usr/lib/jvm/jdk1.6.0_34/bin/javac sudo update-alternatives--config java sudo Update-alternatives--config Javac java-version So the Java configuration is good.
3, RThis is the Ninja handbook that I refer to Yihui Xie, which is basically the next few steps. Add software Library in synaptic Deb Http://mirrors.ustc.edu.cn/CRAN//bin/linux/ubuntu precise/gpg--keyserver keyserver.ubuntu.com-- Recv-key e084dab9 gpg-a--export e084dab9 | sudo apt-key add-sudo apt-get update sudo apt-get install R-base-dev Configure environment variables. Vim ~/. Renviron r_libs_user= "~/r-library" To configure the startup file. Vim ~/. Rprofile options (repos = c (CRAN = "http://mirrors.ustc.edu.cn/CRAN/", Cranextra = "Http://www.stats.ox.ac.uk/pub/RWin" ))
4. Hadoop I'm using this version of hadoop-1.0.3. After downloading, unzip to the specified path, for example I was placed under/home/yibo/hadoop. Tar-xzf hadoop-1.0.3-bin.tar.gz then is the environment variable that configures Hadoop, similar to the previous Java environment variable. sudo vim/etc/environment path adds ":/home/yibo/hadoop/hadoop-1.0.3/bin", which becomes "/usr/local/sbin:/usr/local/bin:/usr/ Sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/jdk1.6.0_34/bin:/home/yibo/hadoop/hadoop-1.0.3/bin ". Then to configure SSH without password login localhost, before this need to confirm that SSH and rsync have been installed. Ssh-keygen-t Dsa-p '-F ~/.SSH/ID_DSA cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys ssh next to Hadoop configuration , first the Java path, and modify the java_home=/usr/lib/jvm/jdk1.6.0_34 in the hadoop-1.0.3/conf/hadoop-env.sh file. Then there are several XML files that are modified under Conf. 1-vim core-site.xml <configuration> <property> <name>fs.default.name</name > <value>hdfs://localhost:9000</value> </property> <property> & nbsp <name>hadoop.tmp.dir</name> <value>/home/yibo/hadoop/hadoop-1.0.3/datatmp/ Hadoop-${user.name}</value> </property> </configuration> 2-vim hdfs-site.xml <configuration> <property > <name>dfs.replication</name> <value>1</value> </ property> </configuration> 3-vim mapred-site.xml <configuration> <property> & lt;name>mapred.job.tracker</name> <value>localhost:9001</value> </ Property> </configuration> So this is a pseudo-distributed Hadoop configuration. You can run the test script. View versions Hadoop version format a namenode (then do not easily format, otherwise the Namenode ID change will be more troublesome). Rm/home/yibo/hadoop/hadoop-1.0.3/datatmp/hadoop-yibo/dfs/data/current/version # This is because namenode changes cause the need to modify or delete the version file Hadoop Namenode-format and then start the services. Start-all.sh This allows you to view the information in your browser. namenode-http://localhost:50070/jobtracker-http://localhost:50030/Wait a moment, wait for Datanode to start well, you can copy the file to the distributed system. Hadoop fs-put conf input may enter protected mode if this namenode is not used for the first time. But this protection mode seems to be lifted after 30 seconds. This is true if you need to release manually: Hadoop Dfsadmin-safemoDe leave the next step is to run a test task to find the Dfs Fork fork. Hadoop jar hadoop-examples-*.jar grep input Output ' dfs[a-z. + ' view test results. Three records are displayed. Hadoop Fs-cat output/* finally stops the service. stop-all.sh
5, Rhadoop Rhadoop is a revolutionanalytics project that contains three R packages (rmr,rhdfs,rhbase), each corresponding to mapreduce,hdfs,hbase, enabling us to leverage Hadoop for analytics in the R environment. At present I have no use to hbase, so I have not tried to rhbase this package for the time being. The first is to install the Rhdfs dependent Rjava package. This process may encounter problems with the configuration of Java parameters, you can try the following command. sudo R CMD javareconf java_home=/usr/lib/jvm/jdk1.6.0_34 also need to configure the Hadoop-related environment variables, as usual to modify the/etc/environment file, add two sentences. Hadoop_cmd= "/home/yibo/hadoop/hadoop-1.0.3/bin/hadoop" hadoop_streaming= "/home/yibo/hadoop/hadoop-1.0.3/contrib /streaming/hadoop-streaming-1.0.3.jar "Then the individual R packages are installed in turn. sudo r cmd install-l/usr/local/lib/r/site-library rjava_0.9-3.tar.gz sudo r cmd install-l/usr/local/lib/r/site-library rhdfs_1.0.5.tar.gz sudo r cmd install-l/usr/local/lib/r/site-library rcpp_0.9.13.tar.gz sudo r cmd install-l/usr/loca L/lib/r/site-library rjsonio_0.98-1.tar.gz sudo R CMD install-l/usr/local/lib/r/site-library iterators_1.0.6.tar.gz sudo r cmd install-l/usr/local/lib/r/site-library itertools_0.1-1.tar.gz sudo r cmd install-l/usr/local/lib/r/site-lib Rary digest_0.5.2.tar.gz sudo R CMD install-l/usr/lOcal/lib/r/site-library rmr_1.3.1.tar.gz
6. Examples of WordCount Finally, to use the RMR package and Rhdfs package to do a wordcount experiment, the function is mainly modified from the Rhadoop tutorial. First remember to start Hadoop. The system (' start-all.sh ') system (' Hadoop dfsadmin-safemode leave ') loads the R package. Require (RMR) require (RHDFS) RHDFS package will recommend an initialization test. Hdfs.init () writes the WordCount function. WordCount <-function (input, output=null, split= ' [[:p unct:][:space:]]+ ') { MapReduce (input=input, output= output, Map=function (k, v) { &NBS P V2=unlist (Strsplit (X=v, Split=split)) v3=v2[v2!= '] lapply (v3, function (w) {Keyval (W, 1)}) }, &NB Sp reduce=function (k, vv) { Keyval (k, Sum (unlist ( VV)) })} Enter text directly into the HDFs file and count. Lines <-C (' This was just a joke, ', ' and that's just another joke, ', ' we all like them very much, ', ' beCause they is so funny. ') Lines_dfs <-To.dfs (lines) WC1 <-wordcount (Input=lines_dfs) data.frame (Do.call (Rbind, From.dfs (WC1))) & nbsp key Val 1 a 1 2 is 2 3 1 4 We 1 5 all 1 6 and 1 7 are 1 8 joke 2 9 just 2 like 1 &N Bsp much 1 that 1 them 1 they 1 &N Bsp;this 1 very 1 Funny 1 Another 1 because 1 can also be directly local Pieces into an HDFs file, counting. Data files are from the Hadoopstreaming package example. Hdfs.put (src= '/home/yibo/anna.txt ', dest= '/home/yibo/tmp/anna.txt ') WC2 <-wordcount (input= '/home/yibo/tmp/ Anna.txt ') wc22 <-data.frame (Do.call (Rbind, From.dfs (WC2))) wc22[100:120,] &nbsP key Val fond 1 101 from 2 102 girl 1 103 gold 1 104 good 2 to gown &N Bsp 1 106 Hand 4 107 have 1 108 heat 1 109 home 1 hour 1 111 huge 1 hung &nb Sp 1 113 hurt 1-into 3-just 1 kept 1 117 last 3 118 long 1 119 look &NB Sp 1 Maid 1
Rhadoop Tutorial There is a good saying: Our main goal is not simply to make it easy-to-run a mapreduce job but to make mapreduce jobs first C Lass citizens of the R environment and to make it easy-to-create abstractions based on them. Come on, wait a bit. I'll try to really distribute. That's it, finish the call.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More