Hadoop environment installation and simple map-Reduce example

Source: Internet
Author: User

I. Reference Books: hadoop authoritative guide-Version 2 (Chinese)

Ii. hadoop environment Installation

1. Install the sun-jdk1.6 version

1) currently, I only build a hadoop environment on one server (centos5.5). Therefore, I first uninstall the installed Java.

Uninstall command: Yum-y remove Java

2) download sun-jdk1.6, address: http://download.oracle.com/otn-pub/java/jdk/6u33-b04/jdk-6u33-linux-x64-rpm.bin

3) install Java (go to the directory where the JDK Installation File is located)

Add binfile permission: chmod A + x *

Install, sudo./jdk-6u33-linux-x64-rpm.bin

(If it is installed under a common user, you need to add a statement under the/etc/sudoers file to indicate that the current user can have root permissions. The specific command is as follows:

A. Su Root

B. chmod U + w/etc/sudoers

C. Vim/etc/sudoers

D. Add a line "username (sudoer username you want to create) All = (all) All" under root all = (all) All. Save and exit.

E. chmod U-W/etc/sudoers

)

4) set java_home

Edit the. bashrc file in the user directory and set the java_hoe command: Export java_home =/usr

2. Install hadoop

1) download the corresponding hadoop file from http://hadoop.apache.org/common/releases.html#download( I downloaded version 1.0.3)

2) decompress the file

Command: tar-xzf hadoop-1.0.3.tar.gz

3) test whether hadoop is successfully installed (go to The hadoop installation directory and execute the following commands in sequence)

A. mkdir Input

B. CP CONF/*. xml Input
C. bin/hadoop jar hadoop-examples-*. Jar grep input output 'dfs [A-Z.] +'
D. Cat output/* ("1 dfsadmin" indicates that hadoop is successfully installed)

4) set Environment Variables

Export hadoop_home =/home/username/hadoop/hadoop-1.0.3.
Export Path = $ path: $ hadoop_home/bin
Export classpath =.: $ hadoop_home/hadoop-core-1.0.3.jar: $ hadoop_home/lib: $ classpath

3. Simple map-Reduce example

From the beginning, follow the 20 ~ A simple maxtemperature example of running content on 23 pages (or refer to the page http://answers.oreilly.com/topic/455-get-started-analyzing-data-with-hadoop/) has never been used. In the command line environment, enter

% export HADOOP_CLASSPATH=build/classes% hadoop MaxTemperature input/ncdc/sample.txt output

It is shown that the classnotfound error is similar, and then modified, and an ioexception is thrown. After finding it online for a long time, we can get a feasible solution.

1. Reference

Http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html

Http://blog.endlesscode.com/2010/06/16/simple-demo-of-mapreduce-in-java/

2. Main Steps

Mkdir maxtemperature
Javac-D maxtemperature. Java
Jar CVF maxtemperature. jar-C maxtemperature /.
Hadoop jar maxtemperature. Jar maxtemperature sample.txt output

Note:

Copy the code of the map and reduce classes to maxtemperature. Java, add the static attribute, and run the javac command. If an iterator error is reported, add the corresponding package as follows:

Import java. util. collection;
Import java. util. hashset;
Import java. util. iterator;

Iv. Thoughts

The main difficulty in setting up the hadoop environment for the first time today is that I encountered an error when performing step-by-step operations according to the instructions in the book. I am not sure whether the knowledge in the book is outdated or my operation mistakes, in addition, I am not familiar with Java, which wastes several hours. Finally, we found a correct solution and successfully ran a simple example of Map-reduce (standalone mode ). In general, I took the first step and had a little sense of accomplishment. We hope to use this summer to study hadoop in depth. Come on ~

5. Supplement

The reference page 25th in the hadoop authoritative guide (Chinese version 2) shows that hadoop is not compatible with previous APIs from version 0.20.0, you must rewrite the previous application to make the new API take effect. This indicates that the old API will report some strange errors similar to classnotfound.

Here, I will add some obvious differences between the newly added API and the old API (from the book ):

1. New APIs tend to use abstract classes instead of interfaces, because they are easier to expand. For example, you can add a method (implemented by default) to an abstract class without modifying the implementation method before the class. In the new API, Mapper and CER are abstract classes.
2. The new API is in the org. Apache. hadoop. mapreduce package (and sub-package. Earlier versions of APIs are stored in org. Apache. hadoop. mapred.
3. The new API uses context object extensively and allows user code to communicate with the mapreduce system. For example, mapcontext basically acts as outputcollector and reporter of jobconf.
4. The new API supports both "push" and "pull" iterations. In the two old and new APIs, key/value record pairs are pushed to Mapper, but in addition, the new API allows pulling records from the map () method, which is also applicable to Cer. A useful example of the pull type is to process records in batches, rather than one by one.
5. The configuration of the new API is unified. The old API has a special jobconf object used for job configuration. This is an extension of hadoop's common configuration object (for configuration of daemon, see section 5.1 ). In the new API, there is no such difference, so the job configuration is completed through configuration.
6. The execution of Job control is the responsibility of the job class, rather than the jobclient, which is no longer stored in the new API.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.