Objective:
Two years of graduation, the previous work has not been exposed to big data things, to Hadoop and other unfamiliar, so recently began to learn. For the first time I learned people, the process is full of doubts and puzzled, but I take the strategy is to let the environment run, and then on the basis of the use of more thinking about why.
Through these three weeks (basically the Saturday Sunday, other hours of overtime ah T) exploration, I am currently mainly completed is:
1. In a Linux environment, a pseudo-distributed deployment of Hadoop (SSH-free login), running the WordCount instance successfully. Http://www.cnblogs.com/PurpleDream/p/4009070.html
2. Pack the plugins for Hadoop in Eclipse yourself. Http://www.cnblogs.com/PurpleDream/p/4014751.html
3. Access to Hadoop in Eclipse runs WordCount success. Http://www.cnblogs.com/PurpleDream/p/4021191.html
So I will be divided three times to record my process, for their own convenience, if can help other people, Nature is better!
=============================================================== Long split-line ========================================== ==========================
Body:
In the previous two articles, I focused on how to deploy Hadoop pseudo-distributed mode to a Linux environment during my first learning of Hadoop, and how to compile a Hadoop eclipse plugin myself. If you need it, you can click on the link to the first two articles I listed in the preface.
Today, I'll explain how to use MapReduce in Eclipse. For the following questions, we'll start by explaining how to configure DFS location in Eclipse, and then explain how to run the WordCount instance on a configured place.
The first step is to configure DFS location:
1. After opening eclipse, switch to map/reduce mode, click on the "New Hadoop location" icon in the lower right corner, pop up a popup box, as shown in the page, there are two tabs need to be configured, respectively, the general and advanced Parameters.
2. First, we first configure the content in general. General requires us to configure the host for Map/reduce and HDFs.
(1). Before we installed Hadoop in Linux, we had modified two configuration files, namely Mapred-site.xml and Core-site.xml. When we re-configured, the configured host is localhost and the port number. Because we are accessing Hadoop remotely from your Linux server in eclipse, we need to match the original
The file's localhost is changed to your server's IP address, the port number is the same (of course you can refer to the article on the Internet, configure the host file).
(2). Then the map/in the General tab that we just opened in eclipse The host of reduce is configured as the IP address and port number of your mapred-site.xml configuration, and the host of HDFS is configured as the IP address and port number of your core-site.xml configuration. Note that if you tick the "Use m/r Master Host" option in HDFs, the Host of HDFs will default to the IP address configured in Map/reduce and the port number can be configured separately.
3. We will then configure the content in this tab of advanced parameters, which is much more, but don't be nervous, because we can use his default configuration when we first configure it.
(1) To open this tab, we can look at the contents of this, we will focus on the option value contains "/tmp/..." This directory structure of the options, if we are in the back of the configuration, in fact, it is these items to be modified. The default "Dfs.data.dir" is in the Linux server in the "/tmp" directory, sometimes bring your root account, this can be based on your needs.
(2). If you do not want to use the default configuration above, we will, according to our own needs, in our installation of Hadoop in the Hdfs-site.xml file, the "dfs.data.dir" option is added to the configuration, the corresponding directory can be built in advance, my configuration
<?XML version= "1.0"?><?xml-stylesheet type= "text/xsl" href= "configuration.xsl "?><!--Put Site-specific property overrides in this file. -<Configuration> < Property> <name>Dfs.data.dir</name> <value>/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/dfs_data_dir</value> </ Property> < Property> <name>Dfs.replication</name> <value>1</value> </ Property> < Property> <name>Dfs.permissions</name> <value>False</value> </ Property></Configuration>
(3). Take my configuration as an example, if I do this configuration on the server side of Hadoop, then I need to modify the value of the following several options in the Advanced Parameters tab, note that there is actually a small trick, you will find the " Hadoop.tmp.dir "This option, configure this option to your custom directory location, then turn off this pop-up box, and then select that location to re-click the" Edit Hadoop locations "option in the lower right corner (in the" new Hadoop location, and then switch to the Advanced Parameters tab, you'll find that the directory prefixes for several of the options are changed, and then you're looking at the other options to make sure that the directory prefix has been modified and it's OK.
Dfs.data.dir =/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/dfs/datadfs.name.dir =/ Myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/dfs/namedfs.name.edits.dir =/myself_setted /hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/dfs/namefs.checkpoint.dir =/myself_setted/hadoop/ Hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/dfs/namesecondaryfs.checkpoint.edits.dir =/myself_setted/hadoop /hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/dfs/namesecondaryfs.s3.buffer.dir =/myself_setted/hadoop/ Hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/s3hadoop.tmp.dir =/myself_setted/hadoop/hadoop-1.0.1/myself_ Data_dir/hadoop_tmp_dirmapred.local.dor =/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/ Mapred/localmapred.system.dir =/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/mapred/ Systemmapred.temp.dir =/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/mapred/ Tempmapreduce.jobTracker.staging.root.dir =/myself_setted/hadoop/hadoop-1.0.1/myself_data_dir/hadoop_tmp_dir/mapred/staging
4. After a few steps above, our own location has been configured, at this time if there is no problem, we will be in the upper left corner of Eclipse "DFS location" below, showing that we have just been configured, Right click on the location to select "Refresh" or "reconnect", if the previous configuration is not a problem, will show the a.txt file that we uploaded in the first article, and the output folder that we previously ran Hadoop on the Linux server side, such as. If you do not upload a file, only the directory "Dfs.data.dir" will be displayed.
The second step is to run the word count instance:
After the 1.Location is configured, we can build a MapReduce project in Eclipse.
(1). Use the anti-compilation software to decompile the jar package from the hadoop-1.0.1 installation package, and take the word count class from the Hadoop-examples-1.0.1.jar to the project you just created. Note that if you have previously referenced the Hadoop plugin compiled for Eclipse, referring to the method of my second article, here you need to add a step, right click on the project Select Buidld Path, for the jar with "hadoop-", except " Hadoop-core-1.0.1.jar "and" Hadoop-tools-1.0.1.jar "two jar packages, the rest of the" hadoop-"the beginning of the jar package will be removed. The main reason is that if you do not delete, it will cause some of the wordcount in this class method to be introduced incorrectly.
2. After the project is established, in the WordCount class, right-click "Run Configurations" to pop up a popup box, for example, and then select the "Arguments" tab, where the "program Arguments" Configure your HDFs input file directory and the output directory, and then click "Run", run, if you do not throw an exception in the console to prove that the operation is successful, you can choose the upper left corner location, select "Refresh", Displays your output folder and the resulting file that is running.
The third step, error exclusion:
1. If the previous output folder exists and you run the WordCount method directly in eclipse, it is possible that the console will report the "output folder already exists" error, then you only need to now delete the output folder in location. This error will not be reported.
2. If you run the process of reporting "org.apache.hadoop.security.AccessControlException:Permission denied: ...." "This error is due to the fact that local users want to remotely manipulate Hadoop without permission, so we need to configure the Dfs.permissions property in Hdfs-site.xml to False (by default, true), and refer to the" Hdfs-site.xml "configuration.
3. If you are running the process reported "Failed to set permissions of path: ...." "This error, the workaround is to modify the/hadoop-1.0.1/src/core/org/apache/hadoop/fs/ Fileutil.java inside the Checkreturnvalue, commented out can (some rough, under the window, you can not check). Note that when modified here, the general method on the web is to recompile the Hadoop-core source code and then repackage it. In fact, if you want to save a little bit, we can build a org.apache.hadoop.fs this package in the project, the Fileutil.java this class copy into this package, according to the image below the modification method, modify Fileutil.java on the line. This approach also works because Java runtime, the default in the project's source code to scan the same path of the package, and then the introduction of the jar file.
After the above steps, I think in most cases, you have successfully in the Eclipse remote access to your Hadoop, perhaps in the process of your practice will encounter other problems, as long as you patiently in the online search for information, I believe it can be resolved, remember not to worry too much.
HADOOP3 accessing Hadoop and running WordCount instances in eclipse