About distributed computing in 10 minutes: Google dataflow

Last Update:2014-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Google cloud dataflow is a method for building, managing, and optimizing complex data processing pipelines. It integrates many internal technologies, for example, flume for Efficient Data Parallel Processing and millwheel with a good Fault Tolerance Mechanism for stream processing. Dataflow's current API is only available in Java (in fact, flume itself provides multiple Java/C ++/Python interfaces ).

Compared with the native map-reduce model, dataflow has several advantages:

You can build a complex pipeline. Here, you may reference Brian Goldfarb, product marketing director of the Google cloud platform.

Cloud dataflow can be used to process batch data and stream data. Analyze millions of Twitter data in real time in a world event (such as the World Cup event in the speech. In one stage of the pipeline, we are responsible for reading tweet and extracting tags in the next stage. The other stage classifies tweet (based on emotion, positive, negative, or other aspects ). Filter keywords in the next phase. In contrast, MAP/reduce, an earlier model used to process big data, cannot process such real-time data, and is difficult to apply to such a very long and complex data pipeline.
You do not need to manually configure and manage mapreduce clusters. Automatic code optimization and resource scheduling enable developers to focus on the business logic itself.

Supports seamless switching from batch to streaming mode: Suppose we want to implement a hashtags auto-completion function based on the content generated by users on Twitter.

Example: auto completing hashtags

Prefix Suggestions

Ar # Argentina, # arugularocks, # argylesocks

ARG # Argentina, # argylesocks, # argonauts

Arge # Argentina, # argentum, # Argentine

The code is almost exactly the same as the data stream. dataflow does not have much difference in writing a single program. dataflow abstracts the data into a pcollections ("parallel Collections"), and pcollection can be a set in memory, read from cloud storage, query from bigquerytable, read from pub/sub in stream mode, or compute from user code. To process pcollection, dataflow provides many ptransforms ("parallel transforms"), such as PARDO ("parallel do ") for each element in the pcollection to perform specific operations (similar to the map and reduce functions in mapreduce or the where function in SQL), The groupbykey processes the pcollection of a key-value pairs, group the pairs with the same key (similar to the shuffle step in mapreduce, or group by and join in SQL ). In addition, you can combine these basic operations to define the new transformations. Dataflow itself also provides some commonly used combinations of transformations, such as Count, top, and mean. This is a classic example of batch processing. to convert it to streaming, you only need to change the data source. If we want the model to provide the latest buzzwords and consider the timeliness of the data, we only need to add an additional row to set the data window. For example, we don't need the data before 60 minutes.

Dashboard: You can also learn the execution status of each link in the pipeline on the developer console. Each process box basically corresponds to a line of code.
Ecosystem: bigquery is a supplement to dataflow as a storage system. data that has been cleaned and processed by dataflow can be stored in bigquery, dataflow can also read bigquery for table join operations. If you want to use some open source resources on dataflow (for example, Machine Learning Library in Spark), it is also very convenient.

Example: auto completing hashtags
Prefix	Suggestions
Ar	# Argentina, # arugularocks, # argylesocks
ARG	# Argentina, # argylesocks, # argonauts
Arge	# Argentina, # argentum, # Argentine

To work with dataflow, Google cloud platform also provides developers with a series of tools, including cloud storage, cloud debugging, cloud tracing, and cloudmonitor.

Comparison

Cascading/Twitter scalding: 1) traditional map-reduce can only process a single stream, while dataflow can build the entire pipeline, automatically optimize and schedule. dataflow feels very similar to the cascading (Java) on hadoop at first glance) /scalding (Scala ). 2) their programming models are very similar. dataflow can also be used for local testing. It can be used to upload a simulation set and iterate the computing results above. This is beyond the reach of traditional map-reduce.
Twitter summingbird: The idea of seamless connection between batch processing and stream processing sounds like Twitter summingbird (Scala) that seamlessly connects scalding and Strom ).
Spark: 1) spark also has the benefits of Building Complex pipelines for code optimization and task scheduling. However, programmers are still required to configure resource allocation. 2) When designing a distributed dataset API, spark simulates the scala set operation API, so that the extra syntax learning cost is lower than dataflow. 3) However, dataflow does not seem to mention memory computing, which is the most essential feature of spark. However, spark can be used as an Open Source Tool and connected to the cloud framework as a supplement. 4) in distributed computing, in addition to batch and streaming, graph is also an important issue. Spark has graphx in this regard. dataflow will also integrate graph processing in the future.

Reference

This article is based on official materials

Sneak peek: Google cloud dataflow, a cloud-native Data Processing Service

Google I/O 2014-the dawn of "fast data" (for domestic users)

Link

Google cloud dataflow

Cloud dataflow: A new computing model in the cloud computing Era

Google announces cloud dataflow beta at Google I/O

Google launches cloud dataflow, a Managed Data Processing Service

Mapreduce successor Google cloud dataflow is a game changer for hadoop thunder

Thesis

Flumejava: Easy, efficient data-parallel pipelines, pldi, 2010

Millwheel: Fault-Tolerant stream processing at Internet scale, very large data bases (2013), pp. 734-746

Reprinted please indicate the source: Ten minutes to learn about Distributed Computing: Google dataflow

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

About distributed computing in 10 minutes: Google dataflow

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

About distributed computing in 10 minutes: Google dataflow

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support