Explore the mini-MapReduce of C #

Source: Internet
Author: User

In recent years, the distributed computing programming model of MapReduce is relatively hot, and the distributed computation of MapReduce is briefly introduced in C # as an example.

Read Catalogue

    1. Background
    2. Map implementation
    3. Reduce implementation
    4. Support for distributed
    5. Summarize
Background

A parallel World program ape Xiao Zhang received boss a task, statistics user feedback content in the number of words, in order to analyze the user's main habits. The text is as follows:

        Const stringHamlet =@"Though yet of Hamlet our dear brother's deaththe memory be green, and the IT US befittedto bear our hearts in grief and O ur whole kingdomto be contracted in one brow of woe,yet so far hath discretion fought with Naturethat we with wisest Sorro W think on Him,together with remembrance of ourselves. Therefore Our sometime sister, now our queen,the Imperial jointress to this warlike state,have we, as ' twere with a defeat Ed Joy,--with the auspicious and a dropping eye,with mirth in funeral and with Dirge in marriage,in equal scale weighing de Light and Dole,--taken to Wife:nor has we herein Barr ' Dyour better wisdoms, which has freely gonewith this affair along . For all, our thanks. Now follows, so you know, young fortinbras,holding a weak supposal of our worth,or thinking by our late dear brother ' s D Eathour state to being disjoint and out of frame,colleagued with the dream of He advantage,he hath not fail "D to pester US W ITH message,importing The surrender of those landslost by his father,With all bonds of law,to we most valiant brother. So much for him. Now for ourself and for this time of Meeting:thus much the business is:we has here Writto Norway, uncle of young Fortinb Ras,--who, impotent and bed-rid, scarcely hearsof this he nephew ' s purpose,--to suppresshis further gait herein; In so the levies,the lists and full proportions, is all madeout of He subject:and we here dispatchyou, good Cornelius , and you, voltimand,for bearers of this greeting to old Norway; Giving to your no further personal Powerto business with the king, more than the scopeof these delated articles allow. Farewell, and let your haste commend your duty.";
View Code

Xiao Zhang as a blue Cheung, soon realized:

   varContent = Hamlet. Split (New[] {" ", Environment.NewLine}, Stringsplitoptions.removeemptyentries); varWordcount=Newdictionary<string,int>(); foreach(varIteminchcontent) {                if(WordCount. ContainsKey (item)) Wordcount[item]+=1; ElseWordCount. ADD (item,1); }

As a self-motivated youth, Xiao Zhang is determined to encapsulate the algorithm in an abstract, and support multi-node computing. Xiao Zhang divides the Count program into two major steps: Decomposition and calculation.
The first step: first, the text is decomposed into a certain dimension to the smallest independent unit. (paragraph, Word, letter dimension).
Part Two: Repeat the minimum unit for the combined calculation.
Xiao Zhang's reference to the MapReduce paper design map, reduce as follows:

Map Implementation Mapping

The mapping function maps the text to the smallest unit in the form of Key,value, that is, < words, occurrences (1) >, <word,1>.

     Public int>> Mapping (ienumerable<t> list)        {            foreach in list)                 yield return 1 );        }

Used, output is (brow, 1), (brow, 1), (sorrow, 1), (Sorrow, 1):

            var spit = Hamlet. Split (new"", Environment.NewLine}, stringsplitoptions.removeemptyentries);             var New micromapreduce<string> (new master<string>());             var result= MP. Mapping (spit);
Combine

In order to reduce data communication overhead, the key value of mapping is the duplicate key merging before entering real reduce. The overall calculation speed is also accelerated relative to the pre-calculated part in advance. The output format is (brow, 2), (Sorrow, 2):

  PublicDictionary<t,int> Combine (ienumerable<tuple<t,int>>list) {Dictionary<t,int> dt =NewDictionary<t,int>(); foreach(varValinchlist) {                if(dt. ContainsKey (Val. ITEM1)) Dt[val. ITEM1]+=val.                ITEM2; Elsedt. Add (Val. Item1, Val.            ITEM2); }            returnDT; }
View CodePartitioner

Partitioner is mainly used for grouping, and the statistical data of different nodes are grouped by key.
The output format is: (Brow, {(brow,2)}, (brow,3)), (Sorrow, {(sorrow,10)}, (brow,11)):

 PublicIenumerable<group<t,int>> Partitioner (Dictionary<t,int>list) {            varDict =NewDictionary<t, Group<t,int>>(); foreach(varValinchlist) {                if(!Dict. ContainsKey (Val. Key)) Dict[val. Key]=NewGroup<t,int>(val.)                Key); Dict[val. Key]. Values.add (Val.            Value); }            returnDict.        Values; }
View Code

Group definition:

     Public classGroup<tkey, tvalue>: Tuple<tkey, list<tvalue>>    {         PublicGroup (TKey key):Base(Key,NewList<tvalue>())        {        }         PublicTKey Key {Get            {                return Base.            Item1; }        }         PublicList<tvalue>Values {Get            {                return Base.            ITEM2; }        }    }
View CodeReduce implementation

The reducing function is received, and the data after grouping is calculated by the last statistic.

 Public int int>> groups)        {            Dictionaryint> result=newint> ();             foreach (var in groups)            {                result. ADD (Sourceval.key, SourceVal.Values.Sum ());            }             return result;        }
View Code

封装调用如下:

  PublicIenumerable<group<t,int>> Map (ienumerable<t>list) {            varStep1 =Mapping (list); varStep2 =Combine (STEP1); varStep3 =Partitioner (STEP2); returnStep3; }   PublicDictionary<t,int> Reduce (ienumerable<group<t,int>>groups) {            varStep1 =reducing (groups); returnStep1; }
View Code
   Public  int> MapReduce (ienumerable<t> list)        {            var map = map (list);             var reduce = reduce (map);             return reduce;        }

The overall calculation step diagram is as follows:

Support for distributed

After the abstract encapsulation, although the complexity of the go up. But exposure to the user is a very clear interface that satisfies the data format requirements of MapReduce and can be used.

            var spit = Hamlet. Split (new"", Environment.NewLine}, stringsplitoptions.removeemptyentries);             var New micromapreduce<string> (new master<string>());             var result1= MP. MapReduce (spit);

Xiao Zhang completed the posterior hole big open, considering the future text data volume is very large. So fork a branch, ready to support distributed computing, can later run on multiple server nodes.

Data sharding

Data fragmentation is the splitting of large amounts of data into pieces, scattered across nodes, to facilitate our MapReduce program to calculate. There are mod, consistent hashing, vitual Buckets, Range partition, etc. in the mainstream of the Shard. About consistent hashing on the previous chapter (explore C # of the consistency hash detailed). In Hadoop, HDFs and MapReduce are interrelated mates, one storage and one computation. A unified storage is also required if implemented by itself. So the data source here can be a database or a file. Xiao Zhang just satisfies the boss needs, the general computing framework can be directly used.

Simulating shards

 Public List<ienumerable<t>> Partition (ienumerable<t> list)        {            var temp =  New list<ienumerable<t>>();            Temp. ADD (list);            Temp. ADD (list);             return temp;        }
View CodeWorker Node

Xiao Zhang defines the master,worker role. Master is responsible for compiling the output, which is our main program. Each worker we use a thread to simulate, the final output to the master rollup, the master can finally write to the database or other.

 Public void Workernode (ienumerable<t> list)        {            new Thread (() =            {                var map = map (list);                 var reduce = reduce (map);                Master. Merge (reduce);            }). Start ();        }  
  Public classMaster<t>    {         PublicDictionary<t,int> Result =NewDictionary<t,int>();  Public  voidMerge (Dictionary<t,int>list) {            foreach(varIteminchlist) {                Lock( This)                {                    if(Result.containskey (item. Key)) Result[item. Key]+=item.                    Value; ElseResult.add (item. Key, item.                Value); }            }        }    }
View Code

Distributed Computing Step diagram:

Summarize

The MapReduce model is not very good at performance speed, it has the advantage of hiding the details of distributed computing, disaster tolerance error, load balancing and good programming API, including HDFs, Hive, and so on, a set of large data processing ecological framework system. In the data magnitude is not very large, the enterprise self-implementation of a set of lightweight distributed computing will have many advantages, such as better performance, customizable, database does not need to import and export. From the cost of a lot of savings, because the Hadoop development, operation and maintenance, servers need a lot of manpower and resources.

Explore the mini-MapReduce of C #

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.