Design of Apache crunch (I)

Source: Internet
Author: User
Tags emit shuffle
Background

Apache crunch is the implementation of flumejava. For mapreduce programs that are not easy to directly develop and use, it develops a set of Mr pipelines with data representation models and provides basic and advanced primitives, optimize Mr job execution based on the underlying execution engine. From the perspective of distributed computing, many computing primitives provided by Crunch can find many similarities in Spark, hive, pig, and other places, and their own data read/write, serialization, the implementation of grouping, sorting, and aggregation is similar to that of splitting at various stages of mapreduce. Shadows can be found in hadoop.

This article introduces the design and implementation of crunch in terms of data presentation model, Operation primitive, and serialization processing, different pipeline implementations and integration with hadoop MR and spark engines will be introduced in subsequent articles. As mentioned earlier, many content can be found in similarities in hadoop, spark, pig, and other places.

Read the crunch design and source code structure to better understand the description of the flumejava thesis, better analyze the mapreduce computing and the composition of various stages, and be familiar with hadoop Mr job APIs, it provides good implementation ideas.


Reference: User Guide And source code.


Comparison of computing presentation layers over seven hadoop types:



Data Models and basic classes

Abstract interfaces of Three distributed datasets: pcollection, ptable, and pgroupedtable

? Pcollection <t> represents distributed and unchangeable datasets. It provides paralleldo and Union methods to trigger the dofn operation on each element and return a new pcollection <u>

? Ptable <K, V> is a pcollection <pair <K, V> implementation that represents distributed, unordered multimap. In addition to paralleldo inherited from pcollection, the Union method is rewritten to provide the groupbykey method. The groupbykey method corresponds to the sorting phase in the mapreduce job. In the groupbykey operation, developers can perform fine-grained control on the number of reducers, partition policies, grouping policies, and sorting policies in the shuffle process (see the groupingoptions class ).

? Pgroupedtable <K, V> is the result of the groupbykey operation. It represents a distributed, sorted map and has an iterator. Its implementation is pcollection <pair <K, iterable <v >>>. In addition to paralleldo and Union inherited from pcollection, The combinevalues method is provided to allow the map or reduce side of shuffle to use Aggregation operators that meet the exchange law and combination law (see aggregator class) act on the values of the pgroupedtable instance


Two basic primitive interfaces in pcollection:



Other data conversion operations in org. Apache. Crunch. Lib come from the above four primitives.


Other methods provided by pcollection:

Count (), min (), max (), aggregate (aggregator)

Filter (), cache ()


Ptable provides the following methods:



Pobject <t>, designed in the same way as flumejava, is used to store Java objects. After materialized, you can use the getvalue () method to obtain the value of pobject.

 

Data Flows in from the source, passes through pipeline processing, and finally outputs from the target.

Three types of pipelines are provided: mrpipeline, mempipeline, and sparkpipeline.


Dofn Data Processing

Hadoop's mapreduce job, by configuring job. XML, sets the map and reduce class of the job, reflecting the specific class. The crunch method is to use Java serialization to serialize dofn (dofn implements java. Io. serializable), transmit it to the task in pipeline, and call it. Pay attention to serialization when using (in some cases, transient, static, etc.), especially in the mrpipeline and sparkpipeline environments.

 

Dofn allows access to content in taskinputoutputcontext (a context class of tasks in hadoop), and dofn can be in map or reduce. During execution, the initialize method is triggered first, similar to the setup method in Mapper and reducer. For example, if a third-party non-serialized class is used, you can instantiate it here (declared as transient ). Then the process method is used, and the result is transmitted by the emitter, for example, to the next dofn. Finally, when all input is processed, execute cleanup. On the one hand, some statuses can be passed to the next stage, and on the other hand, resources can be released.

 

Dofn has some similarities with hadoop Mr, such as increment, set/get of context and configuration, and scalefactor, which are used to estimate the size of data after processing, it can affect the optimization of task execution (such as determining the number of reduce and I/O ). The default scalefactor is 0.99, And the subclass will re-write this value, which will be mentioned below.


Filterfn, mapfn, combinefn subclass

Common dofn types include filterfn, mapfn, and combinefn, which are easy to use and test. They are used in several basic abstract classes.

 

The dofn <S, T> process method implements the actual execution logic,

public abstract void process(S input, Emitter<T> emitter);

Emitter corresponds to the output. The subclass system is as follows:



Filterfn <t> inherits dofn <t, t>. You need to implement its accept (T input) method and return Boolean. Its process method calls accept to determine whether to output data.

  public void process(T input, Emitter<T> emitter) {    if (accept(input)) {      emitter.emit(input);    }  }

The filter method of pcollection is to pass in the implementation of a filterfn. Filterfn has child classes such as and, or, not, and is defined in filterfns. Scalefactor is 0.5, which is easy to understand.

 

Mapfn <t> inherits dofn <S, T> and needs to implement its map (S input) method. The return value of T is as follows:

  public void process(S input, Emitter<T> emitter) {    emitter.emit(map(input));  }

Scalefactor is 1.0.

 

Combinefn inherits dofn <pair <s, iterable <t>, pair <S, T>, and is used to pre-process map output before reduce execution to reduce the network overhead of the shuffle process, bind to combinevalues () in pgroupedtable. Combinefn is often used in combination with the Implementation subclass of aggregator.


Use ptypes to serialize data

The package corresponding to this section is org. Apache. Crunch. types, which are all type-related classes.

 

Ptype <t> defines the data serialization and deserialization methods, which are used in paralleldo of pcollection, for example, the simplest:

<T> PCollection<T> parallelDo(DoFn<S, T> doFn, PType<T> type);

Because of the pcollection <t> model design, t is erased by type erasure. The above output type must be consistent with the specified ptype, similar to the following:

PCollection<String> lines = ...;lines.parallelDo(new DoFn<String, Integer>() { ... }, Writables.ints());

Crunch sets two types of ptypefamily, one is hadoop's writable and the other is Avro. Crunch is still relatively close to the MR of Apache hadoop (at least before the emergence of spark, we can only do Pipeline for hadoop Mr ).

Ptypefamily provides some basic types like this,

  PType<Void> nulls();  PType<String> strings();  PType<Long> longs();  PType<Integer> ints();  PType<Float> floats();  PType<Double> doubles();  PType<Boolean> booleans();  PType<ByteBuffer> bytes();

To adapt to ptable, ptype has another sub-system, ptabletype <K, V>, which inherits ptype <pair <K, V>.

 

The construction and extension methods of ptype, ptabletype Avro and writable types are not described.


Data read/write

Most data read/write formats are hadoopinputformat/outputformat. This section briefly introduces the main classes and types.

 

The package corresponding to this section is org. Apache. Crunch. Io, which is a class related to data read/write.


Source

Source <t> and tablesource <K, V> represent the data source, which correspond to pcollection and ptable respectively.

It is used in the read method of pipelie.

<T> PCollection<T> read(Source<T> source);<K, V> PTable<K, V> read(TableSource<K, V> tableSource);

There is an org. Apache. Crunch. Io. From class that defines some static methods to specify the data format and type when reading the data source, such as Writable, and then return source or tablesource.

 

The input types of commonly used sources are as follows:



Target

The definition of target is similar to that of source. It is mainly used in the write method of pipeline. The common types are as follows:



Target has some different writemodes and is an enumeration class, as shown in the following example:

PCollection<String> lines = ...;// The default option is to fail if the output path already exists.lines.write(At.textFile("/user/crunch/out"), WriteMode.DEFAULT);// Delete the output path if it already exists.lines.write(At.textFile("/user/crunch/out"), WriteMode.OVERWRITE);// Add the output of the given PCollection to the data in the path// if it already exists.lines.write(At.textFile("/user/crunch/out"), WriteMode.APPEND);// Use this directory as a checkpoint location, which requires that this// be a SourceTarget, not just a Target:lines.write(At.textFile("/user/crunch/out"), WriteMode.CHECKPOINT);


There is a special sourcetarget <t> class that inherits both source <t> and target, which can act as both an Input Source and an output location.


Materialized data

Pcollection has a materialized method,

  /**   * Returns a reference to the data set represented by this PCollection that   * may be used by the client to read the data locally.   */  Iterable<S> materialize();

Latency is triggered.


Data processing primitive

This section describes the data processing model class under the org. Apache. Crunch. Lib package, which is an advanced primitive.


Groupbykey

The three groupbykey methods of ptable control the data shuffle and processing process,

  PGroupedTable<K, V> groupByKey();  PGroupedTable<K, V> groupByKey(int numPartitions);  PGroupedTable<K, V> groupByKey(GroupingOptions options);

The first is the simplest shuffle. The number of output paritition will be set by the planner to estimate the data size.

Groupingoptions in the third method provides more fine-grained control over the groupbykey, including how to partition, sort, and group data.

If the following execution engine is hadoop, hadoop partitiner and rawcomparator are used for partitioning and sorting.

Groupingoptions is immutable. It is built through groupingoptions. builder and used:

GroupingOptions opts = GroupingOptions.builder()      .groupingComparatorClass(MyGroupingComparator.class)      .sortComparatorClass(MySortingComparator.class)      .partitionerClass(MyPartitioner.class)      .numReducers(N)      .conf("key", "value")      .conf("other key", "other value")      .build();PTable<String, Long> kv = ...; PGroupedTable<String, Long> kv.groupByKey(opts);

Combinevalues

Ptable obtains pgroupedtable through groupbykey. Its combinevalues allows the planner to control the processing of Aggregate functions before and after shuffle.

 

Use the static method of aggregators to use the implementation class of simple Aggregate functions:

PTable<String, Double> data = ...;// Sum the values of the doubles for each key.PTable<String, Double> sums =  data.groupByKey().combineValues(Aggregators.SUM_DOUBLES());// Find the ten largest values for each key.PTable<String, Double> maxes = data.groupByKey().combineValues(Aggregators.MAX_DOUBLES(10));PTable<String, String> text = ...;// Get a random sample of 100 unique elements for each key.PTable<String, String> samp = text.groupByKey().combineValues(Aggregators.SAMPLE_UNIQUE_ELEMENTS(100));

Simple aggregations

Refer to the implementation class of aggregator.


Joins

Supports inner join, leftouter join, right outer join, and full outer join. It is defined in the jointype enumeration class. Joinstrategy performs the join action,

  PTable<K, Pair<U,V>> join(PTable<K, U> left, PTable<K, V> right, JoinType joinType);

Joinstrategy implementation class has


Reduce-sidejoins

Corresponding to the defastrategstrategy class, it is a simple and robust join in hadoop. the processed data from the two inputs are shuffled to the same CER, and the small data is collected, join the large data that comes in with the stream.

 

Map-sidejoins

Corresponding to the mapsidejoinstrategy class, the smaller portion of the data needs to be loaded into the memory, and the smaller portion of the table needs to be cached in the memory of each task.

 

Shardedjoins

Corresponding to the shardedjoinstrategy class, data with the same key can be partitioned to multiple reducers to avoid excessive data volumes on some reducers, because many distributed joins may cause data skew, some reducers may experience insufficient memory.

 

Bloomfilter joins

Corresponding to the bloomfilterstrategy class, suitable for situations where the data volume of the Left table is too large, but it is still far smaller than the data volume of the right table, and most keys of the right table cannot match the data of the Left table.


Cogroups

The cogroup of crunch is similar to the cogroup in pig. It accepts multiple ptable parts and Outputs One bag based on the same key. Cogroup is the first step to process a join operation.


Sorting
Others

Cartisian, coalescing, distinct, sampling, set operations, etc.



Full Text :)




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.