Part III: Log and real-time streaming processing
So far, I'm just describing the ideal mechanism for data replication from end to end. But handling bytes in the storage system is not all about what you want to tell. Finally we found that the log is another way of saying that the log is the core of the stream processing.
But wait, what is flow processing?
If you're a fan of database culture or data infrastructure products in the late 90 or early 21st century, you might be able to relate streaming to building SQL engines or creating "box and Arrow" interfaces for event-driven processing.
If you focus on the large number of open source database systems, you might associate streaming with some Open-source database systems, including: Storm,akka, S4 and Samza. But most people use these systems as asynchronous messaging systems that are no different from the application of the remote process call layer that supports clustering (and indeed in some ways in the open source database system).
These views have some limitations. Stream processing is independent of SQL. It is also limited to real-time flow processing. There is no intrinsic reason to limit your inability to process data from yesterday's or one-month-old streams, and to use a variety of different language expressions for computing.
I think of streaming as a broader concept: the infrastructure for continuous data flow processing. I think computing models can be as pervasive as mapreduce or distributed processing architectures, but have the ability to handle low latency results.
The real-time driver of the processing model is the data collection method. The data collected in batches is processed in batches. Data is constantly collected and is processed sequentially.
The American survey is a good example of collecting data in batches. Statistical surveys are carried out periodically through door-to-door visits, using brute force to discover and count American citizenship information. This approach worked at the beginning of the 1790 survey. The data collection was batch, which included a leisurely ride on horseback, a piece of information on paper, and a batch of records to the central site of people's statistics. Now, in describing this statistical process, people immediately think of why we do not keep records of births and deaths, so that we can generate demographic information that is either persistent or other dimensions.
This is an extreme example, but a lot of data transfer processing still relies on periodic dumps, batch conversion and integration. The only way to handle large-capacity dumps is through batch processing. But as these batches are replaced by continuous supply, people naturally start to deal with the necessary resources smoothly and eliminate delays.
For example, LinkedIn has little data collection. Most of the data or activity data or database changes, both of which occur uninterrupted. In fact, you can think of any business, just as: Jack Bauer told us that the low-level mechanisms are real-time continuous process events. Data is collected in batches, and it is always dependent on some man-made steps, or the lack of digitization or legacy information processed by automated, non-digitized processes. This process is very slow when the mechanism for transmitting and processing the data is mail or manual processing. First-round automation is always maintained in its original form, and it often lasts for quite a long time.
The bulk processing jobs that run every day often simulate an uninterrupted calculation of the size of a day's windows. Of course, lower-level data often change. In LinkedIn, these are sakong, and the mechanics that make them work in Hadoop are tricky, so we implement a set of schemas for managing the incremental Hadoop workflow.
From this point of view, there can be different views on flow processing. Stream processing includes the concept of the time at the bottom of the data processing, it does not require a static snapshot, it can produce a user-controlled frequency of output, without waiting for the entire dataset to arrive. From this point of view, stream processing is a generalized batch processing, with the popularity of real-time data, later more common.
This is why the traditional perspective of flow processing is a benefit base application. I personally think the biggest reason is that the lack of real-time data collection makes uninterrupted processing an academic concept.
I think the lack of real-time data collection is like the fate of a business streaming system. Their customers still need to deal with file-oriented, daily batch processing of ETL and data integration. The company's construction stream processing system focuses on providing a processing engine attached to the real-time data stream, but in the end, very few people actually use real-time data streams. In fact, in the early days of my work at LinkedIn, one company tried to sell us a very good streaming system, but because all of our data was collected by the hour in the file, the best application we had at the time was to input the files into the streaming system at the end of each hour. They noted that this was a universal issue. These exceptions demonstrate the following rules: One of the important business objectives that a streaming system satisfies is: finance, which is the benchmark for real-time data streams, and streaming processing has become a bottleneck.
Even in a healthy batch-processing system, streaming is a fairly broad application capability as an infrastructure. It spans the gap between real-time data request-response services and off-line batch processing. Today's internet companies, about 25% of the code can be divided into this type.
Ultimately, these logs address most of the key technical issues in streaming. In my opinion, the biggest problem it solves is that it allows multiple subscribers to get real-time data. Friends interested in these technical details, we can use open source Samza, it is based on these ideas to build a flow processing system. More technical details of these applications are described in detail in this document.
(Responsible editor: Mengyishan)