Design of the MapReduce development environment based on Eclipse

Source: Internet
Author: User

Wen/Vincentzh

Original link: http://www.cnblogs.com/vincentzh/p/6055850.html

Last weekend was supposed to write this, the result did not think of last weekend, their environment did not set up, the operation of the problem, dragged until Monday to solve the problem. Just this week will also review the content of the previous reading, while reviewing the code understanding, the impression is very deep, to see the things understand also more deeply.

Catalogue
    1. 1. Overview
    2. 2. Environment Preparation
    3. 3. Plug-in configuration
    4. 4. configuration file System connection
    5. 5. Test Connection
    6. 6. Code writing and execution
    7. 7, the problem carding
      1. 7.1 Console no log output issues
      2. 7.2 Permissions Issues
1. Overview

Hadoop provides Java APIs for the development of handlers and, likewise, the ability to develop and debug large programs by building a familiar eclipse development environment on-premises, without deploying the code, and executing and outputting results through eclipse. It is convenient to debug and validate data processing logic through the result of sampling data processing. After the code processing logic is verified correctly, all programs can be packaged and uploaded to the cluster for complete data processing.

Before you build your development environment, you need to deploy your own Hadoop environment, which is more realistic, and the scheduling and configuration of the cluster is almost no different from the configuration maintenance of large clusters (Hadoop environment is built in: Hadoop standalone/pseudo-distributed deployment, Hadoop cluster/ Distributed deployment).

Generally through a single machine and pseudo-distributed environment to develop and debug programs, in a single-machine environment using a local file system, the ability to use the Linux command to easily obtain and view the execution results of the code, on the contrary, in the pseudo-distribution and cluster environment, the code directly from HDFs read and output data, Compared to the local environment needs to put/get the data between local and HDFs, a lot of trouble, the development of debugging procedures are using data sampling, or code execution time is too long, in a single machine and pseudo-distribution environment to verify that the code will be deployed on the cluster of complete data processing. LZ in the virtual deployment of two sets of environments, one is a pseudo-distributed environment, the other is a small cluster, of course, the single-machine/pseudo-distribution/cluster can switch between each other, but their own deployment of the environment, in order to switch the trouble, simply two sets of environment are set up, need to do that environment, Connect directly through the development environment.

2. Environment Preparation

1) Configure the cluster and start all daemons, cluster build See: Hadoop standalone/pseudo-distributed deployment, Hadoop cluster/Distributed deployment.

2) Install Eclipse, locally installed with the same version of JDK and HADOOP on the cluster.

  

3. Configure Plug-ins

Download Hadoop2.x-eclipse-plugin.jar, put it into Eclipse's \plugins directory, restart Eclipse, and in Windows->show view->other you will see map/ The Reduce view, while the left engineering space appears DFS Locations something like a folder.

4. configuration file System connection

Switch to the Map/reduce view for configuration.

The configuration here needs to be noted, consistent with the configuration in the Core-site.xml configuration file on your cluster. The configuration of the LZ configuration file is attached. Of course, someone in the host directly write the hostname, but to the host name and IP mapping is directly written in the Hadoop environment, where the local environment is not able to parse you wrote in the ' Master ' or ' Hadoop ', the most straightforward is to configure with IP, core-site.xml configuration file also use IP for configuration.

  

5. Test Connection

Configuration to complete the test connection, you need to confirm that all daemons start correctly.

6. Code writing and execution

Test code can try to write their own, if only through the environment to build a successful addiction, go to the official website directly to take it. Links are here.

  When you need to write code or copy code and you encounter this problem, why does the engineering workspace have no mapreduce development-related packages, where the MapReduce development package is going to be found, right here.

Before the code test, the new project did not have the relevant jar package that the MapReduce developer needed to use, which is why the same version of Hadoop needed to be installed locally in Windows, which will be used in its installation directory to develop the jar needed to compile the MapReduce program. Package. Set the Hadoop path for Windows Local installation in Windows->preferences->hadoop Map/reduce (for example: E:\ProgramPrivate\ hadoop-2.6.0), the Hadoop-related jar package is automatically imported when the setup is complete and the Hadoop project is created.

I'm going to stick to the API provided by some common basic implementation class to implement the WordCount code bar, specific parameter configuration can be referenced.

1  Packagecom.cnblogs.vincentzh.hadooptest;2 3 Importjava.io.IOException;4 5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.Path;7 Importorg.apache.hadoop.io.LongWritable;8 ImportOrg.apache.hadoop.io.Text;9 ImportOrg.apache.hadoop.mapred.FileInputFormat;Ten ImportOrg.apache.hadoop.mapred.FileOutputFormat; One Importorg.apache.hadoop.mapred.JobClient; A Importorg.apache.hadoop.mapred.JobConf; - ImportOrg.apache.hadoop.mapred.lib.LongSumReducer; - ImportOrg.apache.hadoop.mapred.lib.TokenCountMapper; the  - //implementation of the basic implementation class provided by the Hadoop API WordCount -  Public classWordCount2 - { +      Public Static voidMain (string[] args) -     { +         //jobclient client = new Jobclient (); AConfiguration conf =NewConfiguration (); atjobconf jobconf =Newjobconf (conf); -          -Jobconf.setjobname ("WordCount2"); -Path in =NewPath ("Hdfs://192.168.1.110:9000/user/hadoop/input"); -Path out =NewPath ("Hdfs://192.168.1.110:9000/user/hadoop/output"); - Fileinputformat.addinputpath (jobconf, in); in Fileoutputformat.setoutputpath (jobconf, out); -Jobconf.setmapperclass (Tokencountmapper.class); toJobconf.setcombinerclass (Longsumreducer.class); +Jobconf.setreducerclass (Longsumreducer.class); -Jobconf.setoutputkeyclass (Text.class); theJobconf.setoutputvalueclass (longwritable.class); *          $         //client.setconf (jobconf);Panax Notoginseng         Try -         { the jobclient.runjob (jobconf); +         } A         Catch(IOException e) the         { + e.printstacktrace (); -         }         $     } $}

When execution completes, there will be a corresponding job execution statistics output, and the Refresh folder will see the output files in the DFS file system on the left.

  

7, the problem carding

Running time may be a lot of problems, here is only listed under the LZ encountered problems and solutions, did not meet the nature is not to share with passers-by.

7.1 Console no log output issues

  When Eclipse executes the MapReduce program, there is no output information from the program console, there is no log information, no execution information, and there is no way to know the results of the program execution.

  cause : The console has no log output because the log configuration is not in project.

  solution : Copy the Log4j.properties file from the Hadoop configuration file directory ($HADOOP _home/etc/hadoop/) directly into the project.

  

7.2 Permissions Issues

  Execution of the MapReduce program error message in eclipse similar to:Org.apache.hadoop.security.accesscontrolexception:o Rg.apache.hadoop.security.AccessControlException:Permission Denied:user=john, Access=write, inode= "input": Hadoop: Supergroup:rwxr-xr-x ...

  cause : HDFS on Hadoop has read and write permissions only for users in the deployment environment, and most people should use ' Hadoop ', and our development environment is built locally in Windows, The execution of the program is done directly with the local user to submit and execute the job, Hadoop in the submission of jobs and execution of the job needs to be submitted to the user authorization authentication, natural Windows users do not read and write the HDFS file and the right to submit and execute the job.

  Solution : Set the property dfs.permission to False in the Mapred-site.xml configuration file.

Design of the MapReduce development environment based on Eclipse

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.