About distributed computing in 10 minutes: Google dataflow

Source: Internet
Author: User
Introduction

Google cloud dataflow is a method for building, managing, and optimizing complex data processing pipelines. It integrates many internal technologies, for example, flume for Efficient Data Parallel Processing and millwheel with a good Fault Tolerance Mechanism for stream processing. Dataflow's current API is only available in Java (in fact, flume itself provides multiple Java/C ++/Python interfaces ).

Compared with the native map-reduce model, dataflow has several advantages:

  1. You can build a complex pipeline. Here, you may reference Brian Goldfarb, product marketing director of the Google cloud platform.

    Cloud dataflow can be used to process batch data and stream data. Analyze millions of Twitter data in real time in a world event (such as the World Cup event in the speech. In one stage of the pipeline, we are responsible for reading tweet and extracting tags in the next stage. The other stage classifies tweet (based on emotion, positive, negative, or other aspects ). Filter keywords in the next phase. In contrast, MAP/reduce, an earlier model used to process big data, cannot process such real-time data, and is difficult to apply to such a very long and complex data pipeline.

  2. You do not need to manually configure and manage mapreduce clusters. Automatic code optimization and resource scheduling enable developers to focus on the business logic itself.

  3. Supports seamless switching from batch to streaming mode: Suppose we want to implement a hashtags auto-completion function based on the content generated by users on Twitter.

    Example: auto completing hashtags
    Prefix Suggestions
    Ar # Argentina, # arugularocks, # argylesocks
    ARG # Argentina, # argylesocks, # argonauts
    Arge # Argentina, # argentum, # Argentine

    The code is almost exactly the same as the data stream. dataflow does not have much difference in writing a single program. dataflow abstracts the data into a pcollections ("parallel Collections"), and pcollection can be a set in memory, read from cloud storage, query from bigquerytable, read from pub/sub in stream mode, or compute from user code. To process pcollection, dataflow provides many ptransforms ("parallel transforms"), such as PARDO ("parallel do ") for each element in the pcollection to perform specific operations (similar to the map and reduce functions in mapreduce or the where function in SQL), The groupbykey processes the pcollection of a key-value pairs, group the pairs with the same key (similar to the shuffle step in mapreduce, or group by and join in SQL ). In addition, you can combine these basic operations to define the new transformations. Dataflow itself also provides some commonly used combinations of transformations, such as Count, top, and mean. This is a classic example of batch processing. to convert it to streaming, you only need to change the data source. If we want the model to provide the latest buzzwords and consider the timeliness of the data, we only need to add an additional row to set the data window. For example, we don't need the data before 60 minutes.

  4. Dashboard: You can also learn the execution status of each link in the pipeline on the developer console. Each process box basically corresponds to a line of code.

  5. Ecosystem: bigquery is a supplement to dataflow as a storage system. data that has been cleaned and processed by dataflow can be stored in bigquery, dataflow can also read bigquery for table join operations. If you want to use some open source resources on dataflow (for example, Machine Learning Library in Spark), it is also very convenient.

To work with dataflow, Google cloud platform also provides developers with a series of tools, including cloud storage, cloud debugging, cloud tracing, and cloudmonitor.

Comparison
  1. Cascading/Twitter scalding: 1) traditional map-reduce can only process a single stream, while dataflow can build the entire pipeline, automatically optimize and schedule. dataflow feels very similar to the cascading (Java) on hadoop at first glance) /scalding (Scala ). 2) their programming models are very similar. dataflow can also be used for local testing. It can be used to upload a simulation set and iterate the computing results above. This is beyond the reach of traditional map-reduce.
  2. Twitter summingbird: The idea of seamless connection between batch processing and stream processing sounds like Twitter summingbird (Scala) that seamlessly connects scalding and Strom ).
  3. Spark: 1) spark also has the benefits of Building Complex pipelines for code optimization and task scheduling. However, programmers are still required to configure resource allocation. 2) When designing a distributed dataset API, spark simulates the scala set operation API, so that the extra syntax learning cost is lower than dataflow. 3) However, dataflow does not seem to mention memory computing, which is the most essential feature of spark. However, spark can be used as an Open Source Tool and connected to the cloud framework as a supplement. 4) in distributed computing, in addition to batch and streaming, graph is also an important issue. Spark has graphx in this regard. dataflow will also integrate graph processing in the future.
Reference

This article is based on official materials

Sneak peek: Google cloud dataflow, a cloud-native Data Processing Service

Google I/O 2014-the dawn of "fast data" (for domestic users)

Link

Google cloud dataflow

Cloud dataflow: A new computing model in the cloud computing Era

Google announces cloud dataflow beta at Google I/O

Google launches cloud dataflow, a Managed Data Processing Service

Mapreduce successor Google cloud dataflow is a game changer for hadoop thunder

Thesis

Flumejava: Easy, efficient data-parallel pipelines, pldi, 2010

Millwheel: Fault-Tolerant stream processing at Internet scale, very large data bases (2013), pp. 734-746

Reprinted please indicate the source: Ten minutes to learn about Distributed Computing: Google dataflow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.