Distributed parallel programming with Hadoop, part 3rd

Source: Internet
Author: User
Keywords Name ssh US DFS

Preface

In the first installment of this series: using Hadoop for distributed parallel programming, part 1th: Basic concepts and installation deployment, introduced the MapReduce computing model, Distributed File System HDFS, distributed parallel Computing and other basic principles, and detailed how to install Hadoop, How to run a parallel program based on Hadoop in a stand-alone and pseudo distributed environment (with multiple process simulations on a single machine). In the second article of this series: using Hadoop for distributed parallel Programming, part 2nd: program example and analysis, this paper introduces how to write MapReduce parallel program based on Hadoop for a specific computing task. This article will introduce the true Hadoop distributed operating environment, including how to deploy the distributed environment on multiple common computers, how to deploy and run the MapReduce program remotely in this distributed environment, and briefly introduce the "cloud computing platform" and compute capacity on-demand rental services.




Two preparation work

1. Hardware and network

The use of three machines, the machine name is Homer06, Homer07, homer08, are installed Redhat Enterprise Linux 5.0 (other Linux distributions also), to ensure that the network between the machine smooth, machine name and IP address between the correct resolution, You can ping the machine name of another machine from any machine. If there is a machine name resolution problem, you can set the Hosts file to resolve, of course, a better solution is to configure the DNS server in your network. In addition, you need to create the same user account on three machines, such as Caoyuz, or use the root account directly.

We will use HOMER06 as the Name node of the Distributed File System HDFS and the Job Tracker node in the MapReduce runtime, and we will call homer06 the primary node. The other two machines (Homer07, homer08) as the HDFS Data Node and the Task Tracker node in the MapReduce operation, these nodes can be collectively referred to as from the node. If you need to deploy more machines, it is also easy to use the newly added machine as Data node and Task Tracker node, the configuration process is similar to the three machines described in this article, this is not to repeat.

2. SSH Configuration

In a Hadoop distributed environment, Name node (the master node) requires SSH to start and stop various processes on Data node (from nodes). We need to ensure that each machine in the environment is accessible through SSH, and that name node does not need to enter a password when it logs on to Data Node with SSH, so that name node can easily control other nodes in the background. You can configure SSH on each machine to be implemented using a password-free public key authentication method.

The current popular Linux distributions are generally installed with the SSH protocol open source implementation OpenSSH, and has started the SSH service, that is, these machines should be supported by default SSH login. If your machine does not support SSH by default, please download and install OpenSSH.

The following is the process of configuring SSH without password public key authentication. First, execute the command on the Homer06 machine, as shown in Listing 1:




Code Listing 1


homer06: $ ssh-keygen-t rsagenerating public/private RSA key pair. Enter file in abound to save the key (/HOME/CAOYUZ/.SSH/ID_RSA): Enter passphrase (empty for no passphrase): Enter Mahouve Passphrase Again:your identification super-delegates been saved. Your public key super-delegates been saved in/home/caoyuz/.ssh/id_rsa.pub.the key fingerprint is:2e:57:e2:bf:fd:d4:45:5c:a7:51:3d: f1:51:3c:69:68 root@krusty04


This command will generate a key pair for the current user Caoyuz on the homer06, and the key pair's save path uses the default/HOME/CAOYUZ/.SSH/ID_RSA, which requires a direct carriage return when the passphrase is entered. This generates the certificate and the public key will be stored in the/HOME/CAOYUZ/.SSH directory, forming two file id_rsa,id_rsa.pub. The contents of the Id_rsa.pub file are then copied to the end of the/home/caoyuz/.ssh/authorized_keys file for each machine, including the native homer06, if the machine does not exist/home/caoyuz/.ssh/ Authorized_keys file, you can create one yourself. Note that the content of the Id_rsa.pub file is a long line, which should be noted when copying, not missing characters or mixed with extra line breaks.

Next can do the SSH connection test, from Homer06 to Homer06, homer07, homer08 initiate SSH connection request, ensure that no need to enter a password can SSH connection success. Note the following message appears when you first SSH connection:

The authenticity of host [homer06] can ' t be established. The key fingerprint is:74:32:91:f2:9c:dc:2e:80:48:73:d4:53:ab:e4:d3:1a Are you throaty your want to re-enters connecting (yes/ NO)?

Please enter yes so that OpenSSH will automatically add the information connected to the host to the/home/caoyuz/.ssh/know_hosts file, and the second time you connect, there will be no such message.




Three-mount Hadoop deployment

1. Installing Hadoop and jre1.5

We first install and configure Hadoop on the main control node homer06, and the installation process can refer to the first article in this series. Suppose we install Hadoop in the/home/caoyuz/hadoop-0.16.0 directory, and JRE 1.5 is installed in the/HOME/CAOYUZ/JRE directory.

2. Modify the conf/hadoop-env.sh file

Set java_home environment variable in: Export Java_home= "/home/caoyuz/jre"

3. Modify the Conf/hadoop-site.xml file

In the first installment of this series, we configured the pseudo distributed run mode of Hadoop by modifying this file. Now we can also configure the real distributed runtime environment for Hadoop by configuring this file. Please refer to Code Listing 2 to modify Conf/hadoop-site.xml:




Code Listing 2


<configuration><property><name>fs.default.name</name><value> Homer06.austin.ibm.com:9000</value><description>the name of the default file system. Either the literal string "local" or a host:port for dfs.</description></property><property><name >mapred.job.tracker</name><value>homer06.austin.ibm.com:9001</value><description> The host and port that's MapReduce job tracker SETUPCL at. If ' local ', then jobs are run in-process as a single map and reduce task.</description></property><name> Dfs.name.dir</name><value>/home/caoyuz/hadoopfs/name</value><description>determines Where on the local filesystem the DFS name node should store the name table. If This is a comma-delimited list of directories then the name of the directories, for redundancy. </description></property><property><name>dfs.data.dir</name><vAlue>/home/caoyuz/hadoopfs/data</value><description>determines where on the "local filesystem" an DFS Data node should store its blocks. If is a comma-delimited list of directories, then data would be stored in all named directories, typically on different Devices. Directories that does not exist are ignored.</description></property><property><name> Dfs.replication</name><value>2</value><description>default block replication. The actual number of replications can be specified to the file is created. The????? used if replication is not specified in Create time.</description></property></ Configuration>


Parameter fs.default.name specifies the IP address and port number of name Node, where we set it as homer06 and 9000 ports, and the parameters mapred.job.tracker specify Jobtracker IP address and port number. Here we set it to HOMER06 and 9001 ports. Parameter Dfs.name.dir Specifies the location of the name Node-related data on the local file system, where we set it to/home/caoyuz/hadoopfs/name, and the parameter Dfs.data.dir Specify data node The location of the relevant data on the local file system, which we set to/home/caoyuz/hadoopfs/data. Note that Hadoop creates both directories automatically without having to create them beforehand.

For more parameter configurations, you can refer to the Conf/hadoop-default.xml file and set it in the Conf/hadoop-site.xml file.

4. Set Master-slave node

Modify the Conf/masters file, change the localhost to homer06, modify the conf/slaves file, delete the localhost, add our other two machines homer07, homer08 join, pay attention to each machine line.

5. Deploy Hadoop to other machines

Now that we've installed and configured the Hadoop and JRE on the homer06, we need to deploy it to other machines and do it through the SCP command, as shown in Listing 3:




Code Listing 3


HOMER06: $ scp-r/home/caoyuz/hadoop-0.16.0 homer07:/home/caoyuz/hadoop-0.16.0homer06: $ scp-r/home/caoyuz/jre HOMER07:/HOME/CAOYUZ/JREHOMER06: $ scp-r/home/caoyuz/hadoop-0.16.0 homer08:/home/caoyuz/hadoop-0.16.0homer06: $ scp -r/home/caoyuz/jre Homer08:/home/caoyuz/jre


It is not necessary to copy the JRE directory with the SCP to other machines. All you have to do is make sure that all of your machines have the JRE1.5 version installed and are installed in the same directory.

6. Format a new Distributed file system on HOMER06

As shown in Listing 4:




Code Listing 4


homer06: $ cd/home/caoyuz/hadoop-0.16.0homer06: $ bin/hadoop namenode-format


7. Start the Hadoop process on homer06

As shown in Listing 5:




Code Listing 5


homer06: $ cd/home/caoyuz/hadoop-0.16.0homer06: $ bin/start-all.sh


After startup is complete, the PS-EF command should be able to see 3 new Java processes started on homer06 (Namenode, secondary namenode, Jobtracker), while we can go to Homer07, homer08 Two machines with Ps–ef view, the two machines should have automatically started 2 new Java processes (Datanode, tasktracker)




Four running Hadoop programs

At this point, the entire HADOOP distributed environment has been deployed and the associated background process has started. Now we can try to run the WordCount program we covered in the second article, as shown in Listing 6:




Code Listing 6


homer06: $ mkdir-p/home/test-in# Please put the file under test to the/home/test-in directory of the local file system homer06: $ cd/home/caoyuz/hadoop-0.16.0homer06 : $ bin/hadoop dfs–put/home/test-in Input # Copies the/home/test-in directory on the local file system to the root directory of HDFS, with the directory name changed to input$ bin/hadoop jar Hadoop-0.16.0-examples.jar wordcount input output# View execution results: # Copy files from HDFS to local file system and view: $ bin/hadoop dfs-get output output $ cat Output/*# can also directly view $ bin/hadoop dfs-cat output/*


The process of executing the WordCount program as shown in Listing 6 is exactly the same as the one we described in the first article in the pseudo distributed run environment, but we now have a truly distributed execution environment where our data distribution is stored on the data nodes Homer07 and homer08, You can see some data files in the/home/caoyuz/hadoopfs/data directory of both machines (this is the Dfs.data.dir parameter we specified in Conf/hadoop-site.xml), and the entire wordcount The computational process magically by homer06, Homer07, homer08 three machines in parallel, we can also easily add more machines to participate in the operation. This is the advantage of distributed parallel programs: It's easy to get more storage space and computing power by adding new machines, and the more machines you deploy, the more efficiently you can compute the massive data.




V Deploy distributed programs using IBM MapReduce Tools

In the second article, the basic features and usage of IBM MapReduce Tools are described. Now we focus on how to use IBM MapReduce Tools to deploy MapReduce programs remotely to a HADOOP distributed environment.

Suppose we use the distributed environment that we deployed in the previous section and then use Eclipse to develop the MapReduce program on another machine.

1. Define the location of the Hadoop server

First make sure your Eclipse has installed the IBM MapReduce Tools plugin. Start Eclipse, select Window-> Open Perspective->other, and select MapReduce from the pop-up box so that Eclipse enters a dedicated MapReduce view (perspective).

Then, check to see if you have a dedicated MapReduce Servers view in your MapReduce perspective, select Window-> Show View->other, and then select from the pop-up box MapReduce Tools category below the MapReduce Servers, open this view.

Then, by clicking on the blue icon in the upper-right corner of MapReduce Servers view, you will see an interface that sets the location of the Hadoop Server as shown in Figure one. The Hadoop server mentioned here, specifically to this article, is homer06 this machine. After you enter the parameters, click the "Validate Location" button to check to see if you can properly locate and connect to your Hadoop server. If there is an error, try executing the command at the command line: SSH the_hostname_of_your_hadoop_server, or using the graphical interface's SSH telnet software, to ensure that SSH can connect successfully.




Figure one defines the location of the Hadoop server





2. Create a MapReduce Project

Create a new MapReduce Project in Eclipse and add the WordCount class that we defined in the second article to this project. This class needs to be slightly modified to be deployed directly to the distributed environment we have built, because we used to read command-line arguments to get the input path and output path of the computing task, while the current version of IBM MapReduce Tools Read command line arguments when remote deployment is not supported. For the sake of simplicity of the test, I defined the input path directly as input in the program, and the output path is defined as outputs. Before testing the WordCount program, a batch of files needing to do word frequency statistics should be copied to the input directory of the Distributed File system.

The complete WordCount class code is shown in Listing 7:




Code Listing 7


The import statement omits the public class WordCount extends configured implements Tool {public static class Mapclass extends Implements Mapper<longwritable, text, text, intwritable> {Private final static intwritable one = new intwritable (1); Private Text Word = new text (); Private String pattern= "[^\\w]"; public void Map (longwritable key, Text value, Outputcollector<text, intwritable> output, Reporter Reporter) throws IOException {String = Value.tostring (). toLowerCase (); line = Line.replaceall (Pattern, ""); StringTokenizer ITR = new StringTokenizer (line); while (Itr.hasmoretokens ()) {Word.set (Itr.nexttoken ()); Output.collect (Word, one);} public static class Reduce extends Mapreducebase implements Reducer<text, Intwritable, Text, intwritable> {public void reduce (Text key, iterator<intwritable> values, Outputcollector<text, intwritable> output, Reporter Reporter) throws ioexception {int sum = 0 while (Values.hasnext ()) {sum = Values.next (). get (); } output.collect (Key, New intwritable (sum)); } public int run (string] args) throws Exception {Path tempdir = new Path ("wordcount-temp-" + integer.tostring (new Random () . Nextint (Integer.max_value)); jobconf conf = new jobconf (getconf (), wordcount.class); try {conf.setjobname ("WordCount"); Conf.setoutputkeyclass (Text.class); Conf.setoutputvalueclass (Intwritable.class ); Conf.setmapperclass (Mapclass.class); Conf.setcombinerclass (Reduce.class); Conf.setreducerclass (Reduce.class); Conf.setinputpath (New Path (args[0)); Conf.setoutputpath (TempDir); Conf.setoutputformat (Sequencefileoutputformat.class); Jobclient.runjob (conf); jobconf sortjob = new jobconf (getconf (), wordcount.class); Sortjob.setjobname ("sort"); Sortjob.setinputpath (TempDir); Sortjob.setinputformat (Sequencefileinputformat.class); Sortjob.setmapperclass (Inversemapper.class); Sortjob.setnumreducetasks (1); Sortjob.setoutputpath (New Path (args[1)); Sortjob.setoutputkeyclass (Intwritable.class); Sortjob.setoutputvalueclasS (text.class); Sortjob.setoutputkeycomparatorclass (Intwritabledecreasingcomparator.class); Jobclient.runjob (Sortjob); finally {filesystem.get (conf). Delete (tempdir);} return 0; private static class Intwritabledecreasingcomparator extends Intwritable.comparator {public int compare ( Writablecomparable A, writablecomparable b) {Return-super.compare (A, b);} public int Compare (byte] b1, int s1, int L1, byte [] B2, int s2, int l2) {Return-super.compare (B1, S1, L1, B2, S2, L2);} public static void Main (string] args) throws Exception {string] paths = {"Input", "output"}; int res = Toolrunner.run (new Revisit (), New WordCount (), paths); System.exit (RES); }}


3. Remote deployment and operation

Select the WordCount class in Project Explorer on the left and select Run As->run on Hadoop in the right-click pop-up menu, as shown in Figure two:




Figure II





Then select the Hadoop server we've defined in the Select Hadoop Server pop-up box, and after clicking Finish, MapReduce Tool automatically packs WordCount project into a jar and copies it to a remote Running on the Hadoop server, the output of the entire run is easy to see in Eclipse's console.

4. View Run Results

When you have defined the location of the Hadoop server, Project Explorer on the left side will see a new project (a blue icon in front of the project name) that allows you to browse the files in the Hadoop Distributed File system. By double-clicking the part-0000 file in the output directory, we can view the output of the WordCount program directly in Eclipse, as shown in Figure three:




Figure three


 





Six cloud computing and Hadoop

We know that in the distributed cluster environment can play the parallel advantage of Hadoop, the more the number of machines, the more rapid and efficient processing of massive data. The real problem is, although many companies have to deal with the demand for massive data, but it is not possible to invest specifically to build large-scale cluster environment, Hadoop in them, unavoidably degenerate into "Mara", nowhere to play its advantages, alternative? In the past, this problem is really difficult to solve, today's situation is different. If readers focus on IT industry dynamics, when it is known that the IT industry is touting "cloud computing" and some companies are investing in what is called a "cloud computing platform", the "cloud" here is a bunch of machine-made distributed environments plus some infrastructure software and management software, which will be similar to Hadoop Such distributed computing software, HDFS such a distributed file system, the need for companies and individuals can go to such "cloud computing platform" to rent storage space, lease compute nodes (computing power) do distributed operations.

Amazon, for instance, has launched Amazon S3 (Amazon simple Storage Service) based on Hadoop, providing reliable, fast, scalable networked storage services, and a commercially available cloud platform, Amazon EC2 Amazon Elastic Compute Cloud). Users can store their data on Amazon S3 distributed storage platform, and then go to Amazon EC2 to rent computing power to complete the calculation of the data. Amazon EC2 offers the so-called on-demand hire service, and the current fee is 0.10 dollars per virtual computer (Amazon EC2 calls it a instance) every hour. Unlike traditional host rental services, users can rent a corresponding number of virtual computers based on the size of their own operations, and Amazon will be able to release your leased virtual computer after the operation, and then charge you for the number of virtual computers you lease and the actual running time of this calculation. It means that you spend your money on computing power, but you don't waste a child. The "Blue Cloud" of IBM's cloud computing platform also provides similar functionality for corporate users.

If we're going to write a distributed parallel program based on Hadoop to process a large amount of data, we can compute it on the cloud computing platform provided by IBM, Amazon, and the details of the IBM Blue Cloud, Amazon S3, Amazon EC2 are beyond the scope of this article, Interested readers can go to their official website to get more information.




Vii. Closing remarks

This is the last article in the series. The first article introduced the MapReduce computation model, the Distributed File System HDFS, the distributed parallel computation and so on basic principle, how installs and deploys the stand-alone Hadoop environment, in the second article, we actually wrote a Hadoop parallel computation program, and understood some important programming details, Learn how to use IBM MapReduce Tools to compile, run, and debug the Hadoop Parallel computing program in the ECLIPSE environment. This article details how to deploy a distributed Hadoop environment, use IBM MapReduce Tools to deploy programs to a distributed environment, and briefly describe the current "cloud computing platform" and compute capacity on-demand rental services.

Hopefully these three articles will serve as a trigger for you to feel the fun of MapReduce distributed parallel programming and get started and enjoy getting warmed up for the upcoming so-called "cloud computing" era.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.