Cloud computing with Linux and Apache Hadoop

Source: Internet
Author: User
Keywords nbsp; application name can
Tags allows users apache application applications cat check class cloud

Companies such as IBM®, Google, VMWare and Amazon have started offering cloud computing products and strategies. This article explains how to build a MapReduce framework using Apache Hadoop to build a Hadoop cluster and how to create a sample MapReduce application that runs on Hadoop. It also discusses how to set up time-consuming/disk-intensive tasks on the cloud.

Introduction to Cloud Computing

Cloud computing has been seen as a new trend in the IT industry recently. Cloud computing can be roughly defined as a scalable computing resource that is provided with a service outside of its own environment and paid for by usage. You can access any resource in the cloud over the Internet without having to worry about computing power, bandwidth, storage, security, and reliability.

This article briefly introduces a cloud computing platform such as Amazon EC2 that can lease virtual linux® servers on this platform, and then introduces the open source MapReduce framework Apache Hadoop, which is built into a virtual Linux server to build a cloud computing framework. However, Hadoop can be deployed not only on any vendor-supplied VM, but also on a generic Linux OS on a physical machine.

Before we discuss Apache Hadoop, let's briefly introduce the structure of the cloud computing system. Figure 1 shows the layers of cloud computing and some of the existing services.

Infrastructure as a service (INFRASTRUCTURE-AS-A-SERVICE,IAAS) means leasing infrastructure (computing resources and storage) in the form of services. IaaS allows users to lease computers (that is, virtual hosts) or data centers, and can specify specific quality of service constraints, such as the ability to run certain operating systems and software. Amazon EC2 provides virtual hosts to users as IaaS in these layers. Platform as a service (PLATFORM-AS-A-SERVICE,PAAS) focuses on software frameworks or services, providing APIs for "cloud" computing in the infrastructure. As a PaaS, Apache Hadoop is built on a virtual host as a cloud computing platform.


Figure 1. Layers of cloud computing and existing services

Amazon EC2

Amazon EC2 is a WEB service that allows users to request virtual machines with a variety of resources (CPU, disk, memory, and so on). Users only need to pay for the calculated time they use, and everything else is done to Amazon.

These instances (Amazon Machine Image,ami) are based on Linux and can run any application or software you need. After renting a server from Amazon, you can set up a connection and maintain the server as a physical server using a generic SSH tool.

A detailed introduction to EC2 is beyond the scope of this article.

The best way to deploy the Hadoop cloud computing framework is to deploy it on the AMI so that you can take advantage of cloud resources without having to think about computing power, bandwidth, storage, and more. However, in the next section of this article, we will build Hadoop in a local Linux server VMWare image, because Hadoop is not only for cloud solutions. Before we go, let's introduce Apache Hadoop.

Apache Hadoop

Apache Hadoop is a software framework (platform) that can manipulate large amounts of data in a distributed manner. It appeared in 2006, supported by companies such as Google, Yahoo! and IBM. It can be considered a PaaS model.

Its design core is the MapReduce implementation and HDFS (Hadoop Distributed File System), which originate from MapReduce (introduced by a Google file) and Google file system.

MapReduce

MapReduce is a software framework introduced by Google that supports distributed computing of large datasets on a computer (i.e., node) cluster. It consists of two processes, mapping (map) and reduction (reduce).

During the mapping process, the master node receives input, splits the input into smaller subtasks, and then distributes the subtasks to the worker node.

The worker nodes handle these small tasks and return the results to the master node.

Then, during the reduction process, the primary node combines the results of all subtasks into output, which is the result of the original task.

Figure 2 illustrates the concept of the MapReduce process.

The advantage of MapReduce is that it allows for distributed processing of mappings and reduction operations. Because each mapping operation is independent, all mappings can be executed in parallel, which reduces the total calculation time.

HDFS

A complete introduction to HDFS and its use methods is beyond the scope of this article.

From the end-user's point of view, HDFS is like a traditional file system. You can use directory paths to perform CRUD operations on files. However, because of the nature of distributed storage, there are "Namenode" and "DataNode" concepts, and they assume their respective responsibilities.

Namenode is the primary node of the DataNode. It provides meta Data services in HDFS. Metadata description DataNode file mappings. It also receives an operation command and determines which DataNode should perform operations and replication.

DataNode as a HDFS storage block. They also respond to block creation, deletion, and replication commands received from Namenode.

Jobtracker and Tasktracker

When submitting an application, you should provide the input and output directories contained in HDFS. Jobtracker as a single control point for starting the MapReduce application, it determines how many tasktracker and subtasks should be created and assigns each subtask to Tasktracker. Each Tasktracker reports status to the Jobtracker and completes the task.

Typically, a master node acts as Namenode and Jobtracker, from nodes as DataNode and Tasktracker. The conceptual view of the Hadoop cluster and the MapReduce process are shown in Figure 2.


Figure 2. The concept view and MapReduce process of the Hadoop cluster

Set Apache Hadoop

Now set up the Hadoop cluster on the Linux VM and then run the MapReduce application on the Hadoop cluster.

Apache Hadoop supports three deployment modes:

Standalone mode: By default, Hadoop runs in a separate, distributed mode. This pattern is suitable for application debugging. Pseudo-Distribution mode: Hadoop can also run in a single node pseudo distribution mode. In this case, each Hadoop daemon runs as a separate Java™ process. Full distribution mode: Hadoop is configured on different hosts and runs as a cluster.

To set up Hadoop in a separate or pseudo distribution mode, refer to the Hadoop Web site. In this article, we'll just talk about setting up Hadoop in full distribution mode.

Preparing the Environment

In this article, we need three GNU servers, one as the primary node and two from the node.

Table 1. Server information

Server IP

Server Host Name

Role

9.30.210.159

vm-9-30-210-159

Master node (Namenode and Jobtracker)

9.30.210.160

vm-9-30-210-160

From Node 1 (DataNode and Tasktracker)

9.30.210.161

vm-9-30-210-161

From Node 2 (DataNode and Tasktracker)

Each machine needs to have Java SE 6 and Hadoop binaries installed. This article uses the Hadoop version 0.19.1.

You also need to install SSH on each machine and run sshd. Popular Linux distributions such as SUSE and RedHat are already installed by default.

Set up communication

Update the Hosts file to make sure that the three machines can communicate with each other using IP and hostname.

Because the Hadoop master uses SSH to communicate with the from node, an authenticated, password-free SSH connection should be established between the primary node and the node. The following commands are executed on each machine to generate the RSA public and private keys.

ssh-keygen–t RSA


This generates id_rsa.pub in the/root/.ssh directory. Rename the id_rsa.pub of the master node (renamed here 59_rsa.pub) and copy it to the from node. The following command is then executed to add the public key of the master node to the authorized key from the node.

cat/root/.ssh/59_rsa.pub >>/root/.ssh/authorized_keys


Now try using the SSH connection from the node. You should be able to connect successfully without the need to provide a password.

Setting the Master node

Setting Hadoop to full distribution mode requires configuration files in the <hadoop_home>/conf/directory.

Configure Hadoop deployment in Hadoop-site.xml. The configuration here overrides the configuration in Hadoop-default.xml.


Table 2. Configuration Properties

The

property interprets Fs.default.name namenode uri Mapred.job.tracker jobtracker URI dfs.replication The number of copies hadoop.tmp.dir temporary directory

Hadoop-site.xml

&lt;?xml version= "1.0" &gt;&lt;?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?&gt;&lt;!--put Site-specific property overrides in this file. --&gt;&lt;configuration&gt; &lt;property&gt; &lt;name&gt;fs.default.name&lt;/name&gt; &lt;value&gt;hdfs:// 9.30.210.159:9000&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;mapred.job.tracker&lt;/name&gt; &lt;value&gt;9.30.210.159:9001&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;dfs.replication&lt;/ name&gt; &lt;value&gt;1&lt;/value&gt; &lt;/property&gt; &lt;property&gt; &lt;name&gt;hadoop.tmp.dir&lt;/name&gt; &lt;value&gt;/root/hadoop/tmp/&lt;/value&gt; &lt;/property&gt;&lt;/configuration&gt;


Specify java_home by configuring the hadoop-env.sh file. Comment out this line and specify your own Java_home directory.

Export java_home=&lt;java_home_dir&gt;


Add the IP address of the master node in the master file.

9.30.210.159


Add the IP address from the node in the slave file.

9.30.210.1609.30.210.161


Set from node

Copy Hadoop-site.xml, hadoop-env.sh, masters, and slaves to each from node; You can use the SCP or other replication tools.

Format HDFS

Run the following command to format the HDFS Distributed File system.

&lt;hadoop_home&gt;/bin/hadoop Namenode-format


Check the Hadoop cluster

You can now start the Hadoop cluster using bin/start-all.sh. The command output indicates the primary node and some logs from the node. Check these logs to make sure everything is OK. If something is messed up, you can format the HDFS and empty the temporary directory specified in Hadoop-site.xml, and then reboot.

Access the following URL to confirm that the primary and the from node are normal.

namenode:http://9.30.210.159:50070jobtracker:http://9.30.210.159:50030


Now that the Hadoop cluster has been set up in the cloud, it's time to run the MapReduce application.

Creating MapReduce Applications

MapReduce applications must have a "map" and a "reduced" nature, meaning that tasks or jobs can be split into small fragments for parallel processing. You can then reduce the results of each subtask and get the results of the original task. One such task is the search for a site keyword. Search and crawl tasks can be split into subtasks and assigned to from nodes, then aggregate all the results on the master node and get the final results.

Trial sample Application

Hadoop comes with some sample applications for testing. One is the word counter, which counts the number of times a word appears in several files. Check the Hadoop cluster by running this application.

First, put the input file in the Distributed File system (below the conf/directory). We'll count the number of occurrences of words in these files.

$ bin/hadoop fs–put conf input


Then, run the sample application, and the following command counts the number of words that begin with "DFS."

$ bin/hadoop jar hadoop-*-examples.jar grep input Output ' dfs[a-z.] +'


The output of the command describes the mapping and scaling process.

The first two commands generate two directories in the HDFS, "input" and "output." You can use the following commands to list them.

$ bin/hadoop Fs–ls


View the files that have been exported in the Distributed File system. It lists the number of occurrences of a word that begins with "dfs*" as a key-value pair.

$ bin/hadoop fs-cat ouput/*


Now, visit the Jobtracker site to see the completed job log.

Creating the Log Analyzer MapReduce application

Now create a Portal (IBM websphere®portal v6.0) Log Analyzer application that has a lot in common with WordCount applications in Hadoop. This analyzer searches all portal systemout*.log files to show how many times the application has started on the portal for a specific time period.

In the Portal environment, all the logs are segmented into 5MB fragments and are well suited to be analyzed in parallel by several nodes.


Hadoop.sample.PortalLogAnalyzer.java

public class Portalloganalyzer {public static class Map extends Mapreducebase implements Mapper&lt;longwritable, Text, Text, intwritable&gt; {private static String App_start_token = "creator started:"; Private Text Creator = new text (); public void Map (longwritable key, Text value, Outputcollector&lt;text, intwritable&gt; output, Reporter Reporter) throws IOException {String line = value.tostring (); if (Line.indexof (App_start_token) &gt;-1) {int startIndex = Line.indexof (APP _start_token); StartIndex + + app_start_token.length (); String appName = line.substring (StartIndex). Trim (); Application.set (AppName); Output.collect (creator, new intwritable (1)); }} public static class Reduce extends Mapreducebase implements Reducer&lt;text, Intwritable, Text, intwritable&gt; { public void reduce (Text key, iterator&lt;intwritable&gt; values, Outputcollector&lt;text, intwritable&gt; output, Reporter Reporter) throws IOException {int sum = 0 while (Values.hasnext ()) {sum = values.Next (). } output.collect (Key, New intwritable (sum)); } public static void main (string] args) throws IOException {jobconf jobconf = new jobconf (portalloganalyzer.class); Jobconf.setjobname ("Portal Log Analizer"); Jobconf.setoutputkeyclass (Text.class); Jobconf.setoutputvalueclass (Intwritable.class); Jobconf.setmapperclass (Map.class); Jobconf.setcombinerclass (Reduce.class); Jobconf.setreducerclass (Reduce.class); Jobconf.setinputformat (Textinputformat.class); Jobconf.setoutputformat (Textoutputformat.class); Fileinputformat.setinputpaths (jobconf, New Path (args[0)); Fileoutputformat.setoutputpath (jobconf, New Path (args[1)); Jobclient.runjob (jobconf);}


For a complete explanation of the Hadoop API, see the API documentation on the Hadoop Web site. Here's a brief description.

The map class implements the mapping feature, which searches every line of the log file for the name of the application. The application name is then placed in the output collection as a key-value pair.

The Reduce class calculates the sum of all values with the same key (the same application name). As a result, the key-value pairs that the final output of this application represent the number of times each application started on the Portal.

The Main function configures and runs the MapReduce job.

Run Portalloganalyzer

First, copy the Java code to the master node and compile it. Copy the Java code into the <hadoop_home>/workspace directory. Compile and archive it in a Jar file, followed by the Hadoop command running the file.

$ mkdir classes$ JAVAC–CP. /hadoop-0.19.1-core.jar–d classes hadoop/sample/portalloganalyzer.java$ JAR–CVF portalloganalyzer.jar–c classes/.


Copy the Portal log into the workspace/input. Suppose you have multiple log files that contain all the logs for May 2009. Put these logs in the HDFS.

$ bin/hadoop fs–put workspace/input input2


When you run Portalloganalyzer, the output describes the mapping and scaling process.

$ bin/hadoop jar Workspace/portalloganalizer.jar Hadoop.sample.PortalLogAnalizer input2 output2



Figure 3. The output of the task

The output should be similar to Figure 4 after the application has finished executing.

$ bin/hadoop fs–cat output2/*



Figure 4. Partial output

When you visit the Jobtracker site, you see another completed job. Note the last line in Figure 5.


Figure 5. Completed job

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.