Excerpt from this article-Flumejava

Source: Internet
Author: User
Tags shuffle

This excerpt does not guarantee the completeness and accuracy of the understanding.


The original MapReduce. Sub-map,shuffle,reduce.

The map contains shards.

Shuffle understood as a groupbykey thing. Reduce contains combiner, which can define sharder to control how key is associated with reducer worker.


Core abstraction and Primitive primitives

Pcollection<t> is an immutable bag. Can be ordered (Sequence), can also be unordered (Collection). Pcollection can come from an in-memory Java pcollection object. can also read from a file.

Ptable<k, V>, can be regarded as pcollection<pair<k, v>>, immutable disorder Multi-map.

The first primitive is Paralleldo(). Turn the pcollection<t> into a new pcollection<s>, and the process is defined in Dofn<t, s>. EMITFN is Call-back, which is passed to the user's process (...) and is fired using Emitfn.emit (Outelem). Paralleldo () can be used in map or reduce, DOFN should not use global variables outside the closure, and the inline function operates its own inputs purely.

The second primitive is groupbykey(). Turn ptable<k, v> into Ptable<k,collection<v>>

The third primitive is combinevalues(), which receives input as ptable<k,collection<v>> and a V-compliant method, returning ptable<k, v>.

The fourth primitive is flatten(), receives a list of pcollection<t>, returns a pcollection<t>


Derived primitives (Derived Operations)

Count (), receiving pcollection<t>, returning ptable<t, integer>

The implementation method is Paralleldo (). Groupbykey () and Combinevalues ()

Join (), receive ptable<k, V1>,ptable<k, v2>. return to Ptable<k,tuple2<collection<v1> collection<v2>>

The implementation method is, the first step, using Paralleldo () to each input ptable<k, vi> into a generic ptable<k, taggedunion2<v1,v2>>. The second step is to use Flattern to combine the tables, the third step. Use Groupbykey () to be flattened over the tables. Produce Ptable<k,collection<taggedunion2<v1, v2>>>

Top (), receiving the comparison function and N,

Implemented in Paralleldo (), Groupbykey (), and Combinevalues ()


Latency analysis (deffered Evaluation)

The Pcollection object has two states, defferred or materialized.

Flumejava.run () really triggers the materialization/operation of the execution plan.


Pobjects

The pobject<t> is used to store Java objects, which can then be used to obtain pobject values using the GetValue () method. A little like the future.

Operate () method


Optimizer

Paralleldofusion (Fusion)

Producer-consumer and Sibling Fusion, for example with


Generally speaking, ABCD is a paralleldo of the same input. Can be fused together in a Paralleldo, i.e. a+b+c+d, in the processing.

Some intermediate results can also be not.

Mapshufflecombinereduce (MSCR) operation

The core of the Flumejava optimizer is the Paralleldo. The combination of Groupbykey,combinevalues and Flattern is converted into a single mapreduce.

MSCR is a middle-tier operation with m input channels (each capable of map operations). There are r a reduce channels (each capable of shuffle, or combine, or reduce operations).

A single input channal m, receiving the pcollection<tm> as input, runs the Paralleldo "map" operation of the R-Path output outputs. Produces r ptable<kr, vs> outputs. Each output channel R flatterns It's M inputs, and then

A) Conduct a Groupbykey "shuffle", or combinevalues "combine". or or-output Paralleldo "reduce" and write the results to Or-output pcollections

b) write the inputs directly as a outputs

The output channel of the former is called the "Grouping" channel, which is called the "pass-through" channel. The "pass-through" channel agrees that the output of map becomes a MSCR operation.



Each MSCR operation can be completed with a mapreduce. It makes mapreduce more generic and is now:

? Agree to multiple reducers and combiners.

? Agree that each reducer produces multiple outputs;

? Eliminate the constraint that each reducer must output output with the same key as input;

? Agree to pass-through form of outputs.

So MSCR is a very good intermediate operating target in the optimizer.


MSCR Fusion

The MSCR operation results from a collection of related groupbykey operations. The associated Groupbykey operation refers to input generated from the same input (such as the Flattern operation) or created by the same Paralleldo operation.

This part is more obscure than difficult to understand. But to understand the core



Global optimization Strategy

The effect of optimization is to have as few and efficient MSCR operations as possible in the final run-up plan.

1. Sink Flatterns. The flat operation is sunk, such as H (f (a) +f (b)) and H (f (a)) +h (f (b)), i.e. the distribution law. It can then be combined with the fusion characteristics of Paralleldo such as (HoF) (a) + (Hog) (b)

2. Lift combinevalues. Suppose Combinevalues followed the groupbykey operation.

3. Insert Fusion blocks. Suppose the two Groupbykey operations are connected by the producer-consumer Paralleldo chain. Paralleldo to do in the Groupbykey and move down.

4. Fuse Paralleldos.

5. Fuse Mscrs.


For the implementation of these strategies. A sample is shown in the following example and a detailed diagram of the operation is depicted. Very helpful to understand


The shortcomings of optimization and future work

The optimizer does not analyze user-written methods, such as estimating the size of input and output data volumes.

No changes are made to the user's code to optimize.

Need to do some analysis to avoid the repetition of the operation, and remove unnecessary or unreasonable groupbykey.


Executor

After the optimization run is complete.

It now supports batch this mode to submit jobs.

In action. Flumejava do human development, debug, create their own initiative to delete files. Self-actively identify data parallel adjustment operations, change operation mode (remote) and such things.


Applause:)

Excerpt from this article-Flumejava

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.