Fast construction of MapReduce algorithm for thick-accumulated thin hair

Source: Internet
Author: User
Keywords nbsp Algorithm Sichuan
Tags .net change close course data data processing demand distribution
Original: http://www.kamang.net/node/223





the reader is impatient, I did not, so first say the conclusion: you can not edit the program, as long as the mouse to drag a few icons, change parameters, you can complete the distribution of billion data processing procedures.





of course, the ideal goal has not yet been achieved, but the road has been plainly displayed in front of us, at least we have come close to half.





first of all, the MapReduce algorithm itself comes from functional programming, so it is reasonable to use FP's idea to construct the algorithm. The previous program was developed with Haskell and now has a new version written in Python.





has done some practical applications of Mr, found that many problems have basic algorithm patterns, and several models are very simple. The follow-up will be summed up, here is to say: (own summary, compare Cottage)


mapreduce Algorithm Mode





1. Meta mode: MR Chain


multiple MapReduce processes can be strung together to achieve arbitrary and complex statistical algorithms.


can also be called the Data flow pattern


2. Map Mode


includes field count, field Join two


3. Reduce Mode


Keycount, Value Sum, Nubcount, value Join


Core Thought





(borrowed from the Java World)


1. Data Flow Programming: The source data flows from one end of the Mr Network, processed in a processing chain, obtains the final result, the chain can have several branches


2. Combinatorial programming: The use of generic Mapper, reducer operators, combined to achieve complex functions,


This is a multiplication process, combined with Mr Chain, you can multiply the complexity of the processing.


Try to maintain the simplicity and atomicity of each operator and function orthogonal.


3. function Corrie: A combination of parameters can be customized to generate user-defined functions


Application Example


input Data





time IP province user uuid


"03-09-2008 17:11:10" 1987636648 "Sichuan" "0ce12c9121ca8e2484440b4459781bdb"


"03-09-2008 17:11:15" 1018955844 "Zhejiang" "19173bb499f4b0a62f19afeb5ba5017a"


"03-09-2008 17:11:18" 2030878566 "Guangdong" "B596b9655d2acd4d449d5262c1b9d3be"


"03-09-2008 17:11:19" 1947385333 "Guangdong" "9cf2210902bbf421e9df1cb384b65cc7"


"03-09-2008 17:11:24" 1964392548 "Shaanxi" "7ebe2805fbdfab3c7b11395cb76364f4"


"03-09-2008 17:11:35" 3722701596 "Jiangsu" "Cda23cc1ebac208168c8af1c88d03e55"


"03-09-2008 17:11:09" 1034301425 "Yunnan" "5573f458f859e35d7ddca346fd1a35a8"


"03-09-2008 17:11:09" 1987636648 "Sichuan" "0ce12c9121ca8e2484440b4459781bdb"


"03-09-2008 17:11:09" 1987636648 "Sichuan" "0ce12c9121ca8e2484440b4459781bdb"


"03-09-2008 17:11:10" 1987636648 "Sichuan" "0ce12c9121ca8e2484440b4459781bdb"





statistical Demand





number of UUID reported in each province,


the number of times per UUID escalated


the number of different reporting times, respectively, how many people


Process





Two-Mr, the first to produce the first two results of the demand, get the intermediate results to the second Mr, get the third demand result.





Task Description





test_tasks = {


' Task1 ': {' name ': ' Task1 ',


' input ': ' Userinfo.test ',


' Mrs ': [(' Province ', (', ' M_field_count (2) '), [' Keycount ', ' Nubcount ']),


(' uuid ', (', ' M_field_count (3) '), [' Keycount ']),


                        ],


' output ': ' Task1.out ',


' Next ': [' task2 ']


               },





' Task2 ': {' name ': ' Task2 ',


' Input ': ' Task1.out ',


' Mrs ': [(' Uuid_count_nub ', (' c_uuid ', ' m_field_join (1, 0) '), [' Nubcount ']]


                        ],


' output ': ' Task2.out ',


' Next ': []


    }


}





reads task descriptions through the framework, automatically generates test run scripts, and 4 programs:





run.sh





#!/bin/sh


Cat Userinfo.test | Python task1_map.py | Sort | Python task1_reduce.py > Task1.out


Cat Task1.out | Python task2_map.py | Sort | Python task2_reduce.py > Task2.out





task1_map.py, task1_reduce.py, task2_map.py, task2_reduce.py are automatically generated.





Execution test:





Task1:





$ head-n userinfo.test |./task1_map.py | Sort | Python/task1_reduce.py





C_province_ "Yunnan" 1


Nc_province_ "Yunnan" 1


C_province_ "Sichuan" 4


Nc_province_ "Sichuan" 1


C_province_ "Guangdong" 2


Nc_province_ "Guangdong" 1


C_province_ "Jiangsu" 1


Nc_province_ "Jiangsu" 1


C_province_ "Zhejiang" 1


Nc_province_ "Zhejiang" 1


C_province_ "Shaanxi" 1


Nc_province_ "Shaanxi" 1


c_uuid_ "0CE12C9121CA8E2484440B4459781BDB" 4


c_uuid_ "19173bb499f4b0a62f19afeb5ba5017a" 1


c_uuid_ "5573f458f859e35d7ddca346fd1a35a8" 1


C_uuid_ "7EBE2805FBDFAB3C7B11395CB76364F4" 1


c_uuid_ "9CF2210902BBF421E9DF1CB384B65CC7" 1


c_uuid_ "B596b9655d2acd4d449d5262c1b9d3be" 1


c_uuid_ "Cda23cc1ebac208168c8af1c88d03e55" 1





Task2:





$ head-n userinfo.test |./task1_map.py | Sort | Python/task1_reduce.py | Python task2_map.py | Sort | Python task2_reduce.py





nc_uuid_count_nub_1 6


Nc_uuid_count_nub_4 1





actually run, throw it on Hadoop and run, the previous article said.





the whole process, you just need to write a configuration file describing each task, and what map and Reduce are in each task.


Follow-up work





Perfect Frame, automatic generation program, etc.


collects and collates Mapper, reducer operators.


a web-or GUI-based Mr Chain designer.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.