Introduction
Google cloud dataflow is a method for building, managing, and optimizing complex data processing pipelines. It integrates many internal technologies, for example, flume for Efficient Data Parallel Processing and millwheel with a good Fault Tolerance Mechanism for stream processing. Dataflow's current API is only available in Java (in fact, flume itself provides multiple Java/C ++/Python interfaces ).
Compared with the native map-reduce model, dataflow has several advantages:
You can build a complex pipeline. Here, you may reference Brian Goldfarb, product marketing director of the Google cloud platform.
Cloud dataflow can be used to process batch data and stream data. Analyze millions of Twitter data in real time in a world event (such as the World Cup event in the speech. In one stage of the pipeline, we are responsible for reading tweet and extracting tags in the next stage. The other stage classifies tweet (based on emotion, positive, negative, or other aspects ). Filter keywords in the next phase. In contrast, MAP/reduce, an earlier model used to process big data, cannot process such real-time data, and is difficult to apply to such a very long and complex data pipeline.
You do not need to manually configure and manage mapreduce clusters. Automatic code optimization and resource scheduling enable developers to focus on the business logic itself.
Supports seamless switching from batch to streaming mode: Suppose we want to implement a hashtags auto-completion function based on the content generated by users on Twitter.
Example: auto completing hashtags
Prefix |
Suggestions |
Ar |
# Argentina, # arugularocks, # argylesocks |
ARG |
# Argentina, # argylesocks, # argonauts |
Arge |
# Argentina, # argentum, # Argentine |
The code is almost exactly the same as the data stream. dataflow does not have much difference in writing a single program. dataflow abstracts the data into a pcollections ("parallel Collections"), and pcollection can be a set in memory, read from cloud storage, query from bigquerytable, read from pub/sub in stream mode, or compute from user code. To process pcollection, dataflow provides many ptransforms ("parallel transforms"), such as PARDO ("parallel do ") for each element in the pcollection to perform specific operations (similar to the map and reduce functions in mapreduce or the where function in SQL), The groupbykey processes the pcollection of a key-value pairs, group the pairs with the same key (similar to the shuffle step in mapreduce, or group by and join in SQL ). In addition, you can combine these basic operations to define the new transformations. Dataflow itself also provides some commonly used combinations of transformations, such as Count, top, and mean. This is a classic example of batch processing. to convert it to streaming, you only need to change the data source. If we want the model to provide the latest buzzwords and consider the timeliness of the data, we only need to add an additional row to set the data window. For example, we don't need the data before 60 minutes.
Dashboard: You can also learn the execution status of each link in the pipeline on the developer console. Each process box basically corresponds to a line of code.
Ecosystem: bigquery is a supplement to dataflow as a storage system. data that has been cleaned and processed by dataflow can be stored in bigquery, dataflow can also read bigquery for table join operations. If you want to use some open source resources on dataflow (for example, Machine Learning Library in Spark), it is also very convenient.
To work with dataflow, Google cloud platform also provides developers with a series of tools, including cloud storage, cloud debugging, cloud tracing, and cloudmonitor.
Comparison
- Cascading/Twitter scalding: 1) traditional map-reduce can only process a single stream, while dataflow can build the entire pipeline, automatically optimize and schedule. dataflow feels very similar to the cascading (Java) on hadoop at first glance) /scalding (Scala ). 2) their programming models are very similar. dataflow can also be used for local testing. It can be used to upload a simulation set and iterate the computing results above. This is beyond the reach of traditional map-reduce.
- Twitter summingbird: The idea of seamless connection between batch processing and stream processing sounds like Twitter summingbird (Scala) that seamlessly connects scalding and Strom ).
- Spark: 1) spark also has the benefits of Building Complex pipelines for code optimization and task scheduling. However, programmers are still required to configure resource allocation. 2) When designing a distributed dataset API, spark simulates the scala set operation API, so that the extra syntax learning cost is lower than dataflow. 3) However, dataflow does not seem to mention memory computing, which is the most essential feature of spark. However, spark can be used as an Open Source Tool and connected to the cloud framework as a supplement. 4) in distributed computing, in addition to batch and streaming, graph is also an important issue. Spark has graphx in this regard. dataflow will also integrate graph processing in the future.
Reference
This article is based on official materials
Sneak peek: Google cloud dataflow, a cloud-native Data Processing Service
Google I/O 2014-the dawn of "fast data" (for domestic users)
Link
Google cloud dataflow
Cloud dataflow: A new computing model in the cloud computing Era
Google announces cloud dataflow beta at Google I/O
Google launches cloud dataflow, a Managed Data Processing Service
Mapreduce successor Google cloud dataflow is a game changer for hadoop thunder
Thesis
Flumejava: Easy, efficient data-parallel pipelines, pldi, 2010
Millwheel: Fault-Tolerant stream processing at Internet scale, very large data bases (2013), pp. 734-746
Reprinted please indicate the source: Ten minutes to learn about Distributed Computing: Google dataflow