Introduction to the Java Library Apache crunch for simplifying MapReduce programming

Source: Internet
Author: User

The Apache Crunch (incubator project) is a Java library based on Google's Flumejava library, which is used to create MapReduce pipelining. Similar to other high-level tools used to create mapreduce jobs, such as Apache Hive, Apache Pig, and cascading, Crunch provides a pattern library for common tasks such as connecting data, performing aggregations, and sorting records. Unlike other tools, crunch does not force all input to follow the same data type. Instead, Crunch uses a custom type system that is flexible enough to handle complex data types directly, such as time series, HDF5 files, Apache hbase tables, and serialized objects (like protocol buffer or Avro Records).

Crunch does not want to prevent developers from thinking in a mapreduce way, but to try to simplify them. Although MapReduce has many advantages, it is not the right level of abstraction for many issues: most interesting calculations are made up of multiple mapreduce jobs, which is often the case--for performance reasons, we need to separate logically independent operations such as data filtering, Data projection and Data transformation) are combined into a physical mapreduce job.

In essence, Crunch is designed as a thin layer on top of MapReduce, hoping to solve the problem at the right level of abstraction without sacrificing mapreduce power (or the developer using the MapReduce API).

Although crunch can be reminiscent of the long history of the cascading API, their respective data models are very different: by common sense, it can be argued that people who think of problems as data streams prefer crunch and pig, People who consider the SQL-style connection will prefer cascading and hive.

The idea of crunch

Pcollection and Ptable<k, v> is the core abstraction of crunch, which represents a distributed, immutable collection of objects, which is a sub-interface of pcollection, which contains additional methods for handling key-value pairs. These two core classes support the following four basic operations:

Paralleldo: Applies a user-defined function to a given pcollection, returning a new pcollection as the result. Groupbykey: Sort and group the elements in a ptable by key values (equivalent to the shuffle phase in the MapReduce job)

Combinevalues: Performs an association operation to aggregate the values from the Groupbykey operation.

Union: To consider two or more pcollection as a virtual pcollection.

All high-order operations of the crunch (joins, Cogroups, and set operations, etc.) are implemented through these basic primitives. The Crunch Job scheduler (Job Planner) receives an operation diagram defined by the pipeline developer, decomposes the operation into a series of related mapreduce jobs, and executes on the Hadoop cluster. Crunch also supports the memory execution engine, which can be used to test and debug the pipeline on local data.

Some problems can benefit from a large number of user-defined functions that can manipulate custom data types, and crunch is designed for this problem. User-defined functions in crunch are designed to be lightweight and provide complete access to the underlying MapReduce APIs to meet the needs of the application. Crunch developers can also use the crunch primitives to define APIs to provide customers with advanced ETL, machine learning, and scientific computing capabilities involving a range of complex mapreduce operations.

Crunch Start

You can download the latest version of source code or binaries from the crunch Web site: http://incubator.apache.org/crunch/download.html, or use the dependencies published in Maven.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.