For a long time, large data communities have generally recognized the inadequacy of batch data processing. Many applications have an urgent need for real-time query and streaming processing. In recent years, driven by this idea, a series of solutions have been spawned, with Twitter Storm,yahoo S4,cloudera Impala,apache Spark and Apache Tez to join the big data and NoSQL camps. In this paper, we try to explore the techniques used in streaming processing systems, analyze their relationship with large-scale batch processing and OLTP/OLAP databases, and explore how a unified query engine can support streaming, batch and OLAP processing at the same time.
In Grid Dynamics, the need is to build a streaming data processing system that handles 8 billion of events per day and provides fault tolerance and strict transactional, i.e. no loss or repetition of events. The new system is the complement and successor of the existing system, the existing system based on Hadoop, data processing latency is high, and maintenance costs are too high. Such requirements and systems are fairly generic and typical, so we describe them as canonical models as an abstract problem statement.
The following figure shows a high-level overview of our production environment:
This is a typical large data infrastructure: each application in multiple data centers is producing data, the data is transported through the data collection subsystem to the HDFS on the central facility, and the raw data is aggregated and analyzed through the standard Hadoop stack (mapreduce,pig,hive). The rollup results are stored on HDFS and NoSQL, and then exported to the customized user application access on the OLAP database. Our goal is to equip all the facilities with a new streaming engine (see bottom) to handle most of the dense data streams, to deliver the HDFS data to the top, to reduce the amount of raw data in Hadoop and the load on the batch job.
The design of a streaming processing engine is driven by the following requirements:
Sql-like function: The engine can perform sql-like queries, including joins on the time window and various aggregate functions to implement complex business logic. The engine can also handle relative static data (admixtures) that is loaded from the rollup data. More complex multi-channel data mining algorithms are not within the short-term target range.
Modularity and flexibility: The engine is not just a simple execution sql-like query, and then the corresponding pipeline is automatically created and deployed, it should be able to connect the modules, it is easy to combine more complex data processing chain.
Fault tolerance: Strict fault tolerance is the engine's basic requirement. As in a sketch, a possible design is to use distributed data-processing pipelines to implement joins, summaries, or chains of these operations, and then connect them through fault tolerant, persistent buffer. Using these buffer, you can implement the communication mode of publish/subscribe, can increase or remove the pipeline very conveniently, so as to maximize the system modularization. Pipelines can also be stateful, and the engine's middleware provides persistent storage to enable the stateful checkpoint mechanism. All of these topics will be discussed in the following chapters of this article.
Interacting with Hadoop: The engine should be able to access streaming data and data from Hadoop, serving as a custom query engine on top of HDFs.
High performance and portability: even on the smallest cluster, the system can deliver thousands of messages per second. The engine is supposed to be compact and efficient, capable of being deployed on small clusters in multiple data centers.
In order to figure out how to implement such a system, we discuss the following topics:
First, we discuss the relationship between the streaming data processing system, the batch processing system and the relational query engine, and the streaming system can be used in many other types of systems.
Secondly, we introduce some of the patterns and techniques used in building streaming frameworks and systems. In addition, we have investigated emerging technologies to provide some tips on how to achieve them.
The article isbased on a developer project developed at Grid Dynamics Labs. Much of the creditgoes to Alexey Kharlamov and Rafael Bagmanov who led the project and Othercontributors:dmitry Suslov, K Onstantine Golikov, Evelina Stepanova, Anatolyvinogradov, Roman belous, and Varvara Strizhkova.
The basis of distributed query processing
Distributed streaming data processing is obviously related to distributed relational database. Many standard query processing techniques can be applied to streaming engines, so it is very useful to understand the classical algorithms of distributed query processing and to understand their relationship with streaming and other popular frameworks such as MapReduce.
Distributed query processing has been developed for decades and is a large area of knowledge. We start with a concise overview of some of the key technologies that provide the basis for the discussion below.