Analysis of Hadoop Data flow process

Last Update:2015-03-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop: Data flow graph (based on Hadoop 0.18.3): A simple example of how data flows in Hadoop.

Hadoop: Data flow graph (based on Hadoop 0.18.3):Here is an example of the process of data flow in Hadoop, an example of how the total number of words in some articles is counted. First, files represent these articles that require statistical vocabulary. first, Hadoop allocates the initial data to the mapper task of each machine, and the figures in the figure represent the sequential flow of data. 1. format the input, by default, Hadoop uses Textinputformate, which is the line number as key, and the line string as the input form of the Value.map function is <K1,V1>. 2.The map function is as follows. such as statistical vocabulary, we can write this.

public void Map (longwritable key,text value,outputcollector<text,inwritable> output,reporter Reporter) {// Output is the export of the map function.    String line = value.tostring ();//value    per row StringTokenizer ITR = new StringTokenizer (line);//based on space participle while    (Itr.hasmoretokens ()) {        Output.collect (new Text (). Set (Itr.nexttoken ()), New intwritable (1));//output, key is word, value 1.}}

3.the output of the map function is in the form List<k2,v2> Record each word, and set value to 1. Indicates that the word appeared once, and then the sum of the value of the same key is calculated at the back of the statistic. 4.combiner process (not necessary), can be understood as local reduce, the local first calculation, a key pair of the same key to summarize, such as ' AC ' The word has appeared two times, where the output is < "AC",2>. 5.the partitioner mainly distributes the results of the map output and distributes it to reduce in different machines, allowing reduce to process. So what's the allocation? The default in Hadoop is to assign a hash value to a key. This process is called shuffle process. 6.reduce function, the entry is <K2,LIST<V2>>, the format of output in map is list<k2,v2> After the shuffling process, it is partitioned and then combined. , it became the <k2,List<v2>>. Corresponding to the example of Word segmentation statistics, key2 corresponds to a word,list<v2> corresponding to a different machine map function to derive the total number of a vocabulary. The output is in the form of <k3,v3>. The reduce method for lexical statistics is as follows:

public void reduce (Text key,interator<inwritable> values,outputcollector<text,intwritable> output, Reporter Reporter) throws Ioexception{int sum = 0;while (Values.hasnext ()) {//Sum

7. The output format is <k3,v3> this can be used as an input to the next map function. InputFormat: By default, Hadoop uses Textinputformat as the input formatting tool, and key is the line number, which is generally of no use to us. When we need to distinguish between key and value according to the delimiter of the line, such as data in this format, "Class name" (using tabs between class names), we need to use class as key, name as value, We can use Keyvaluetextinputformat, the default delimiter is a tab (\ t), we can set the delimiter by Key.value.separator.in.input.line. Squencefileinputformat<k,v>,nlineinputformat may also be used depending on other requirements. Partitioning:hadoop in the default Hashjpartitioner sometimes can not meet our needs, you can achieve partioner<k,v>, to achieve their own partioner. The Partioner interface needs to implement two methods, configure () and Getpartition (). The Configure () method applies the job's configuration to Partitioner, which returns an integer between 0 and the reduce task.

Analysis of Hadoop Data flow process

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of Hadoop Data flow process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of Hadoop Data flow process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support