Compile hadoop2.0 source code in eclipse

Source: Internet
Author: User
Tags hadoop mapreduce

Hadoop is a distributed system infrastructure maintained and updated by the Apache Foundation. Official website address: http://hadoop.apache.org/

The hadoop project mainly includes the following four modules:

  • Hadoop common: Provides Infrastructure for other hadoop modules.
  • Hadoop HDFS: a high-performance, high-throughput distributed file system.
  • Hadoop mapreduce: a distributed computing framework, including task scheduling and cluster resource management.
  • Hadoop yarn: A New mapreduce framework. If you are interested, please refer to: http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

Due to project requirements, I only need to use the first two submodules of hadoop, namely hadoop common and hadoop HDFS.

Before compiling the source code, let me introduce my development environment:

  • Ubuntu 12.04 lts
  • Eclipse 4.3
  • Jdk1.6.0 _ 45
  • Maven 3.0.4
  • Svn1.6.17
  • Protocolbuffer (it seems that Ubuntu comes with it. If not, download and install it on your own)

The latest hadoop uses Maven as the project construction tool, so Maven must be installed in the system. Next we will officially start the compilation of hadoop source code.

First, use SVN to check out the latest version of hadoop (hadoop 2 .*):

svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/ hadoop-dev

Open the hadoop-dev folder. The directory structure is shown in figure

This is the source code directory of hadoop. By the way, the number of lines in the source code is counted as 1231074 (including comments and blank lines ). This article focuses on two subprojects: hadoop-common-project and hadoop-HDFS-project.

Next we need to build a hadoop project for the import of eclipse. Although we only care about the two sub-projects mentioned above, in order to prevent subsequent dependency problems, run the following command in the project root directory during building:

cd ~/hadoop-devmvn install -DskipTestsmvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true

When executing the MVN (Maven) command, ensure that the network connection is normal, because MVN may download some jar packages to solve the dependency problem. This may take some time. After completing the preceding command, you have prepared to import eclipse.

Before importing, we have another job: Install the maven plug-in of Eclipse. The installation method is not described here.

The next step is to import eclipse for compilation and open eclipse. The steps are as follows:

  • Menu File> Import...
  • Select "existing projects into workspace"
  • Select the hadoop-common-project directory under the hadoop-dev directory as the root directory
  • Select hadoop-annotations, hadoop-auth, hadoop-auth-examples, hadoop-NFS, and hadoop-common projects.
  • Click "finish"
  • Menu File> Import...
  • Select "existing projects into workspace"
  • Select the hadoop-assemblies directory under the hadoop-dev directory as the root directory.
  • Select hadoop-assemblies Project
  • Click "finish"
  • Menu File> Import...
  • Select "existing projects into workspace"
  • Select the hadoop-HDFS-project directory under the hadoop-dev directory as the root directory
  • Select hadoop-HDFS Project
  • Click "finish"

Because only hadoop modules are used in my project, only some hadoop modules are imported. If you want to import other modules for secondary development, you can import the corresponding sub-projects in the same way as above.

Next, compile hadoop using eclipse, click Run-> RUN configuration ..., the run configuration dialog box is displayed. A Maven build is displayed on the left. Double-click Maven build to create a configuration page.

Note that base directory is the root directory of the hadoop project, that is ~ /Hadoop-Dev. Click Run To start compiling the hadoop project. This takes some time. Note that the network connection is normal during this period. The reason is the same as above.

In fact, the above process can also be completed by the command line. The Eclipse plug-in can save that step. The command line compilation method is as follows:

cd ~/hadoop-devmvn package -Pdist -DskipTests -Dtar

Return to eclipse. After compilation is successful, the eclipse Console window will output the build sucess information, which indicates that the hadoop project has been compiled successfully.

In order to debug hadoop, the next step is to build a hadoop environment using hadoop compiled above.

The preceding compilation results are stored in the target directory of each project. Taking hadoop-common as an example, the compilation result is ~ /Hadoop-dev/hadoop-common-Project/hadoop-Common/target/hadoop-common-3.0.0-SNAPSHOT. The structure below this directory is as follows:

Other such as hadoop-HDFS and hadoop-mapreduce are under the corresponding target directory. The path is similar to the above, and the directory structure is the same.

First, create a hadoop directory (mkdir ~ /Hadoop), copy all the items in the directory to the new directory. Since I only use common and HDFS, my copy process only applies to these two sub-projects. (Currently, we have found a good way to compile it. We can only compile it and copy it from each Sub-Project. Please leave a message if you have a solution ~)

Due to the tedious process above, I wrote a script and will release it to GitHub later (with the address of the script on GitHub: https://github.com/meibenjin/hadoop2.0-configuration). If you can't wait, just copy it first. After completing the preceding operations ,~ /The directory structure in hadoop is the same.

Now, let me briefly introduce the directory structure of the new version of hadoop. It looks like a Linux directory structure. The bin and sbin directories contain some hadoop commands, and the ECT directory contains the configuration file. The share directory contains some jar packages required by hadoop.

I will not write the hadoop configuration here (if necessary, I will write another blog). The specific configuration can be viewed on the hadoop website. You can also read this blog: the slaves and http://www.cnblogs.com/scotoma/archive/2012/09/18/2689902.html files mentioned in the yarn-site.xml are under hadoop-yarn-project. For the convenience of debugging hadoop, configure it to the pseudo distribution mode.

After successful configuration, start the hadoop-related process. The command is as follows:

hadoop namenode -formatstart-dfs.sh

Run the following command to check whether the process is successfully started:

This indicates that hadoop-related processes have been successfully started.

Not complete...

The process has been verified: feasible !!!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.