Hadoop-distributed computing solution for massive files

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a Java implementation of Google mapreduce. Mapreduce is a simplified distributed programming mode that automatically distributes programs to a super-large cluster composed of common machines for concurrent execution. Just as Java programmers can ignore memory leaks, mapreduce's run-time system will solve the distribution details of input data, execute scheduling programs across machine clusters, and process machine failures, and manage communication requests between machines. This mode allows programmers to process resources of ultra-large distributed systems without having any experience in concurrent processing or distributed systems.

I. Introduction

As a hadoop programmer, what he wants to do is:
1. Define Mapper, process the Input key-value pair, and output intermediate results.
2. Define CER Cer. (optional) define intermediate results and output the final results.
3. Define inputformat and outputformat. Optional. inputformat converts the content of each input file to a Java class for the Mapper function. If this parameter is not specified, the default value is string.
4. define the main function, define a job in it, and run it.

Then the task is handed over to the system.
1. Basic Concept: hadoop HDFS implements Google's GFS file system. namenode runs on the master as the file system and datanode runs on each machine. At the same time, hadoop implements Google mapreduce. jobtracker runs on the master node as the mapreduce master node and tasktracker runs on each machine to execute tasks.

2. Main () function, create jobconf, define Mapper, CER, input/outputformat and input/output file directory, and finally submit the job to jobtracker. Wait until the job ends.

3. jobtracker: Create an inputformat instance, call its getsplits () method, split the file in the input Directory into filesplist as the input of the Mapper task, and generate the Mapper task and add it to the queue.

4. tasktracker calculates the next map/reduce from jobtracker.

The Mapper task first creates a recordreader from inputformat, cyclically reads the content of filesplits to generate the key and value, and passes it to the Mapper function. After processing, the intermediate result is written as sequencefile.
Reducer task obtains the required intermediate content (33%) from the jetty of tasktracker running mapper using the HTTP protocol, sort/Merge (66%), and executes the reducer function, write the result directory according to outputformat.

Tasktracker reports the running status to jobtracker every 10 seconds. every 10 seconds after a Tasker is completed, the task is requested from jobtracker.

All data processing in the nutch project is built on hadoop. For details, see Scalable Computing with hadoop.

Ii. Code Compiled by programmers

Let's make a simple distributed grep, which simply performs row-by-row regular matching on the input file. If yes, print the row to the output file. Because all outputs are simple, we only need to write mapper functions, neither reducer functions nor input/output format.

Package demo. hadoop

Public class hadoopgrep {

Public static class regmapper extends mapreducebase implements mapper {

Private pattern;

Public void configure (jobconf job ){
Pattern = pattern. Compile (job. Get ("mapred. Mapper. RegEx "));
}

Public void map (writablecomparable key, writable value, outputcollector output, reporter)
Throws ioexception {
String text = (text) value). tostring ();
Matcher = pattern. matcher (text );
If (matcher. Find ()){
Output. Collect (Key, value );
}
}
}

Private hadoopgrep (){
} // Singleton

Public static void main (string [] ARGs) throws exception {

Jobconf grepjob = new jobconf (hadoopgrep. Class );
Grepjob. setjobname ("grep-search ");
Grepjob. Set ("mapred. Mapper. RegEx", argS [2]);

Grepjob. setinputpath (New Path (ARGs [0]);
Grepjob. setoutputpath (New Path (ARGs [1]);
Grepjob. setmapperclass (regmapper. Class );
Grepjob. setreducerclass (identityreducer. Class );

Jobclient. runjob (grepjob );
}
}

The configure () function of the regmapper class accepts the search string passed in by the main function. The map () function performs regular matching. The key indicates the number of rows, and the value indicates the content of the file rows, the matching file rows are placed in the intermediate result.
The main () function defines the input/output directory passed in by the command line parameters and the matching string. The Mapper function is a regmapper class, And the reduce function does nothing, directly output the intermediate result to the identityreducer class of the final result to run the job.

The entire code is very simple and has no details about distributed programming.

3. Run the hadoop Program

Hadoop documentation is not comprehensive. For more information, see gettingstartedwithhadoop andNutch hadoop tutorialAfter the two articles, a lot of nails were met before they finally ran completely. The record is as follows:

3.1 local Running Mode

It does not perform any distributed computing and does not use any namenode or datanode. It is suitable for debugging code at the beginning.
Decompress hadoop, where the conf directory is the configuration directory, the hadoop configuration file in the hadoop-default.xml, if you want to modify the configuration, not directly modify the file, but modify the hadoop-site.xml, assign a value to this property in the hadoop-site.xml.
The default configuration for the hadoop-default.xml is already running locally and does not need to be modified. The only thing in the configuration directory that must be modified is the location of java_home in the hadoop-env.sh.

Put the compiled hadoopgrep and regmapper. class into the hadoop/build/classes/demo/hadoop/directory to find a large log file and put it in a directory, and then run

Hadoop/bin/hadoop demo. hadoop. hadoopgrep log file directory any output directory grep string

View the output directory results and run logs in hadoop/logs.
Delete the output directory before re-running.

3.2 standalone cluster Running Mode

Now let's take a look at a single-host cluster. Assume that the configuration in 3.1 is completed, and the local name is hadoopserver.
Step 2. Modify the hadoop-site.xml and add the following:

<Property>
<Name> fs. Default. Name </Name>
<Value> hadoopserver: 9000 </value>
</Property>
<Property>
<Name> mapred. Job. Tracker </Name>
<Value> hadoopserver: 9001 </value>
</Property>
<Property>
<Name> DFS. Replication </Name>
<Value> 1 </value>
</Property>

From then on, the Operation has been switched from the local file system to the hadoop HDFS system, and mapreduce's jobtracker has also changed from local operations to distributed task systems, 9000,9001 two port numbers are randomly selected.

In addition, if your/tmp directory is not large enough, you may need to modify the hadoop. tmp. dir attribute.

Step 2. Add SSH and log on without entering the password.

Because hadoop requires no SSH password input for scheduling, without Su, run ssh-keygen-t rsa in the home directory, and press enter to generate the key, enter again. SSH directory, CP id_rsa.pub authorized_keys
For more information, run the SSH hadoopserver command. You do not need to enter any password.

3. Format namenode and execute
Bin/hadoop namenode-format

4. Start hadoop
Run hadoop/bin/start-all.sh and start namenode, datanode, jobtracker, tasktracker

5. Now put the log file to be searched into HDFS ,.
Run hadoop/bin/hadoop DFS to view the file operation commands supported by DFS.
Execute hadoop/bin/hadoop DFS put log file directory in, then the log file directory has been placed in the/user-name/In directory of HDFS

6. Execute the grep operation now.
Hadoop/bin/hadoop demo. hadoop. hadoopgrep in out
View the running logs in hadoop/logs. Run hadoop/bin/hadoop dfs rmr out to delete the out directory.

7. Run hadoop/bin/stop-all.sh end

3.3 cluster Running Mode
Assume that the configuration of 3.2 has been completed. Assume that the name of the 2nd machines is hadoopserver2.
1. Create the same execution user as hadoopserver and decompress hadoop to the same directory.

2. Modify java_home in the same haoop-env.sh and modify the same hadoop-site.xml as 3.2

3. Copy/home/username/. Ssh/authorized_keys in hadoopserver to hadoopserver2 to ensure that hadoopserver can log on to hadoopserver2 without a password.
SCP/home/username/. Ssh/authorized_keys username @ hadoopserver2:/home/username/. Ssh/authorized_keys

4. Modify the hadoop/CONF/slaves file of the hadoop-server, add a cluster node, and change localhost
Hadoop-Server
Hadoop-server2

5. Run hadoop/bin/start-all.sh on hadoop-Server
Namenode, datanode, jobtracker, and tasktracker will be started on hadoop-server.
Start datanode and tasktracker in the hadoop-server2

6. Execute the grep operation now.
Hadoop/bin/hadoop demo. hadoop. hadoopgrep in out
Run hadoop/bin/hadoop dfs rmr out to delete the out directory before re-execution.

7. Run hadoop/bin/stop-all.sh to end.

Iv. Efficiency

After testing, hadoop is not a panacea. It depends on the file size and quantity, the processing complexity, the number of cluster machines, and the connected bandwidth. When the above four are not big, the advantages of hadoop are not obvious.
For example, hadoop's simple grep function written in Java is not used to process m log files for as long as 4 seconds, and hadoop local is used for 14 seconds, it takes 30 seconds to use a hadoop single-host cluster. If you use a 10 m network port of a dual-host cluster, it will be slower.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop-distributed computing solution for massive files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop-distributed computing solution for massive files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support