Common algorithms in Hadoop learning note -12.mapreduce

Source: Internet
Author: User
Tags iterable

First, what are the common algorithms in MapReduce (1) King of the classics:Word Count

This is a classic case of MapReduce, classic can no longer classic!

(2) Data deduplication

The main purpose of "data deduplication" is to grasp and utilize the idea of parallelization to make meaningful screening of data. The seemingly complex task of counting the number of data on a large data set, and computing access from the site log, involves data deduplication.

(3) Sorting: Sort by a key in ascending or descending order

(4) TopK: To sort all the data in the source data, take out the first K data, is TopK.

The heap can often be used to implement TOPK problems.

(5) Selection: basic operation reproduction of relational algebra

Selects the qualifying tuple (record) from the specified relationship to form a new relationship. In relational algebra, the selection operation is an operation on a tuple.

In MapReduce, for example, to take the maximum minimum value, take the minimum row from n rows of data, which is a typical selection operation.

(6) Projection: basic operation reproduction of relational algebra

Selects a subset of properties from the properties (fields) collection of the specified relationship to form a new relationship of the same kind. Duplicate tuples that occur because of property reduction are automatically deleted. The projection operation is for properties.

In MapReduce, for example, in the previous processing of mobile internet logs, 11 fields in the log we selected five fields to show that our mobile internet traffic is a typical projection operation.

(7) Group: GROUP BY XXXX

In MapReduce, grouping is similar to partitioning operations, in order to handle the mobile Internet log as an example, we are divided into mobile phone number and non-mobile phone number of such two groups to be processed separately.

(8) Multi-table connection

(9) Single-Table Association

Second, TOPK general type before K questions

The TOPK problem is a very common practical problem: how to efficiently find the top K Max/min data in a whole bunch of data. Our previous practice was to load the entire data file into memory for sorting and counting. However, when a data file reaches a certain amount, it cannot be loaded directly into memory, unless you want to risk downtime.

Then we think of distributed computing, using a computer cluster to do this thing, for example: Originally a machine takes 10 hours to complete the thing, now 10 machines parallel to calculate, only 1 hours to complete. This time we use a randomly generated 1 million-digit file, which means we have to find the largest of the top 100 numbers in 1 million numbers.

Experimental data: Http://pan.baidu.com/s/1qWt4WaS

2.1 Pre-storage of K data using TreeMap

(1) Realization of red and black trees

How to store the first K data when the TOPK problem is a big core, here we use Java TreeMap to store. The implementation of TREEMAP is the implementation of red-black tree algorithm, red black tree is also known as red-black binary tree, it is a binary tree, its specific binary tree all the characteristics, while the red and black tree is a self-balanced sort of binary tree .

A balanced binary tree must have the following characteristics: It is an empty tree or its left and right two sub-tree height difference of the absolute value of not more than 1, and the left and right two sub-trees are a balanced binary tree. That is, any one of the two fork tree and so on, the height of the left and right sub-tree is similar.

The red and black tree as the name implies is: The node is a red or black balanced binary tree, it through the color constraints to maintain the balance of the two-fork tree .

About : A detailed introduction to TreeMap and red-black trees can be read Chenssy an article: TreeMap and red-black trees, here no longer repeat.

(2) Put method in TreeMap

In the implementation of the TreeMap put () is mainly divided into two steps, the first: to build a sort of binary tree, second: balanced binary tree.

In order to balance the binary tree, it is often necessary to do the left-hand and right-handed and coloring operations, here to see the L-and-R operation, the purpose of these operations are to maintain balance, ensure that the binary tree is orderly, can help us achieve an orderly effect, that is, the data storage is orderly.

2.2 Writing map and reduce function codes

(1) Map function

public static class Mymapper extends Mapper<longwritable, Text, nullwritable, longwritable> {Publ        IC static final int K = 100;        Private Treemap<long, long> TM = new Treemap<long, long> (); protected void Map (longwritable key, Text value, mapper<longwritable, Tex T, Nullwritable, Longwritable>. Context context) throws Java.io.IOException, Interruptedexception {try {long te                MP = Long.parselong (value.tostring (). Trim ());                Tm.put (temp, temp);                    if (Tm.size () > K) {tm.remove (Tm.firstkey ());                If the TOPK is the smallest one then use the following statement//tm.remove (Tm.lastkey ());            }} catch (Exception e) {context.getcounter ("TopK", "errorlog"). Increment (1L);        }        }; protected void Cleanup (Org.apache.hadoop.mapreduce.mapper<loNgwritable, Text, nullwritable, Longwritable>  Context context) throws Java.io.IOException, interruptedexception {for (Long num:tm.values ())            {Context.write (Nullwritable.get (), New longwritable (num));    }        }; }

The cleanup () method is a method that executes after the map method is finished, where we pass the first 100 data in the map task into the reduce task;

(2) Reduce function

    public static class Myreducer extends Reducer<nullwritable, longwritable, nullwritable, longwritable>        {public static final int K = 100;        Private Treemap<long, long> TM = new Treemap<long, long> ();                protected void reduce (nullwritable key, java.lang.iterable<longwritable> values, Reducer<nullwritable, Longwritable, nullwritable, Longwritable> Context context) throws Java.io.IOException, interruptedexception {for (longwritable Num:valu                ES) {tm.put (Num.get (), Num.get ());                    if (Tm.size () > K) {tm.remove (Tm.firstkey ());                If the TOPK is the smallest one then use the following statement//tm.remove (Tm.lastkey ()); }}//In descending order from large to small arrange key set for (Long Value:tm.descendingKeySet ()) {context.           Write (Nullwritable.get (), new longwritable (value)); }        }; }

In the reduce method, the incoming data in the map method is put into TreeMap, and the balance of red and black is used to maintain the order of the data.

(3) Complete code

View Code

(4) Realization effect: The picture size is limited, here only shows the first 12;

Three, the TOPK of special type

The maximum value problem is a typical selection operation that finds the largest or smallest number from 1 million numbers, in this experiment file, the largest number is 32767. Now, let's rewrite the code and find 32767.

3.1 Rewriting the Map function
    public static class Mymapper extends            mapper<longwritable, Text, longwritable, nullwritable> {        long max = Lo Ng. Min_value;        protected void Map (longwritable key, text value,                mapper<longwritable, text, longwritable, Nullwritable>. Context context)                throws Java.io.IOException, interruptedexception {            Long temp = Long.parselong (value.tostring (). Trim ());            if (Temp > Max) {                max = temp;            }        };        protected void Cleanup (                org.apache.hadoop.mapreduce.mapper<longwritable, Text, longwritable, nullwritable . Context context)                throws Java.io.IOException, interruptedexception {            context.write (new Longwritable (max), Nullwritable.get ());        };    }

Is it familiar? is actually compared with the assumed maximum value in turn.

3.2 Overwrite the reduce function
    public static class Myreducer extends            reducer<longwritable, nullwritable, longwritable, nullwritable> {        Long max = Long.min_value;        protected void reduce (                longwritable key,                java.lang.iterable<nullwritable> values,                reducer< Longwritable, Nullwritable, longwritable, Nullwritable> Context context)                throws Java.io.IOException, interruptedexception {            Long temp = Key.get ();            if (Temp > Max) {                max = temp;            }        };        protected void Cleanup (                org.apache.hadoop.mapreduce.reducer<longwritable, nullwritable, Longwritable, Nullwritable>. Context context)                throws Java.io.IOException, interruptedexception {            context.write (new Longwritable (max), Nullwritable.get ());        };    }

In the reduce method, you continue to compare the data that is passed into each map task, and then compare it to the assumed maximum value in turn, and then output the maximum value by using the cleanup method after all the reduce methods have been executed.

The final complete code is as follows:

View Code3.3 Viewing implementation results

  

As you can see, our program has calculated the maximum value: 32767. Although the example is very simple, the business is very simple, but we introduced the idea of distributed computing, the use of MapReduce in the most value problem, is a progress!

Resources

(1) Chao Wu, "in Layman's Hadoop": http://www.superwu.cn/

(2) Suddenly, "Hadoop diary day18-mapreduce Sorting and Grouping": http://www.cnblogs.com/sunddenly/p/4009751.html

(3) Chenssy, "Java Improvement chapter"-treemap ": http://blog.csdn.net/chenssy/article/details/26668941

original link:http://edisonchou.cnblogs.com/

Common algorithms in Hadoop learning note -12.mapreduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.