Analysis of Hadoop Data flow process

Source: Internet
Author: User

Hadoop: Data flow graph (based on Hadoop 0.18.3): A simple example of how data flows in Hadoop.

Hadoop: Data flow graph (based on Hadoop 0.18.3):Here is an example of the process of data flow in Hadoop, an example of how the total number of words in some articles is counted. First, files represent these articles that require statistical vocabulary.  first, Hadoop allocates the initial data to the mapper task of each machine, and the figures in the figure represent the sequential flow of data. 1. format the input, by default, Hadoop uses Textinputformate, which is the line number as key, and the line string as the input form of the Value.map function is <K1,V1>.   2.The map function is as follows. such as statistical vocabulary, we can write this.
public void Map (longwritable key,text value,outputcollector<text,inwritable> output,reporter Reporter) {// Output is the export of the map function.    String line = value.tostring ();//value    per row StringTokenizer ITR = new StringTokenizer (line);//based on space participle while    (Itr.hasmoretokens ()) {        Output.collect (new Text (). Set (Itr.nexttoken ()), New intwritable (1));//output, key is word, value 1.}}

  

3.the output of the map function is in the form List<k2,v2> Record each word, and set value to 1. Indicates that the word appeared once, and then the sum of the value of the same key is calculated at the back of the statistic.   4.combiner process (not necessary), can be understood as local reduce, the local first calculation, a key pair of the same key to summarize, such as ' AC ' The word has appeared two times, where the output is < "AC",2>.   5.the partitioner mainly distributes the results of the map output and distributes it to reduce in different machines, allowing reduce to process. So what's the allocation? The default in Hadoop is to assign a hash value to a key. This process is called shuffle process.   6.reduce function, the entry is <K2,LIST<V2>>, the format of output in map is list<k2,v2> After the shuffling process, it is partitioned and then combined. , it became the <k2,List<v2>>. Corresponding to the example of Word segmentation statistics, key2 corresponds to a word,list<v2> corresponding to a different machine map function to derive the total number of a vocabulary. The output is in the form of <k3,v3>. The reduce method for lexical statistics is as follows: 
public void reduce (Text key,interator<inwritable> values,outputcollector<text,intwritable> output, Reporter Reporter) throws Ioexception{int sum = 0;while (Values.hasnext ()) {//Sum    
  7. The output format is <k3,v3> this can be used as an input to the next map function.    InputFormat: By default, Hadoop uses Textinputformat as the input formatting tool, and key is the line number, which is generally of no use to us. When we need to distinguish between key and value according to the delimiter of the line, such as data in this format, "Class name" (using tabs between class names), we need to use class as key, name as value, We can use Keyvaluetextinputformat, the default delimiter is a tab (\ t), we can set the delimiter by Key.value.separator.in.input.line. Squencefileinputformat<k,v>,nlineinputformat may also be used depending on other requirements.  Partitioning:hadoop in the default Hashjpartitioner sometimes can not meet our needs, you can achieve partioner<k,v>, to achieve their own partioner. The Partioner interface needs to implement two methods, configure () and Getpartition (). The Configure () method applies the job's configuration to Partitioner, which returns an integer between 0 and the reduce task.   

Analysis of Hadoop Data flow process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.