The World Beyond Batch:streaming 101

Source: Internet
Author: User

Https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

In this article, the first thing to be clear is to give ' streaming ' the rectification

What is streaming?

The crux of the problem is so many things that ought to being described by what they am (e.g., unbounded data proc Essing, approximate results, etc.), with come to being described colloquially by how they historically has been ACC Omplished (i.e., via streaming execution engines).

The current definition of streaming is inaccurate, which leads us to misunderstand streaming.
For example, thinking that streaming means low-latency, approximate,lack of precision

The crux of the problem is that we confuse the nature of one thing with the degree to which such things are accomplished.

So here the author gives a definition of streaming,

I prefer to isolate the term streaming to a very specific meaning: A type of data processing engine that is design Ed with infinite data setsin mind. Nothing more.

And for the often appearing and streaming related words, also to distinguish the definition

unbounded Data: A type of ever-growing, essentially infinite data set.
This term is used to describe the characteristics of the dataset itself, and streaming is used to describe the processing engine

unbounded Data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data.
Which is at best misleading:repeated runs of batch engines has been used to process unbounded data since batch systems we Re first conceived
Batch engine can also be used for repeated processing unbounded data
The same streaming engine can also be used to process bounded data
So the word is not the same as streaming.

Low-latency, approximate, and/or speculative results:

The author argues that the batch engine was designed without considering the scenario for Low-latency, and that batch could be low-latency, or approximate or speculative results
Conversely, streaming can also balance low-latency to achieve accurate results.

So,

From here on out, any time I use the term "streaming", "You can safely assume I mean an execution engine designed for Unbou nded data sets, and nothing more.

What is streaming can do?

The recent flow calculation arose from the storm of Twitter's Nathan Marz (creator of Storm), and of course brought streaming to Low-latency, inaccurate/speculative Results such a label

In order to provide eventually correct Results,marz proposed lambda Architecture. This architecture, though seemingly simple, gives a balance of consistency and usability;

Of course the problem is also obvious, you need to maintain streaming and batch two pipeline, this price is very big.

The author represents a bit unsavory for this architecture.

Unsurprisingly, I was a huge fan of Jay Kreps ' questioning the Lambda Architecture post when it came out.

So the next stage was LinkedIn's Jay Krep, who was proposing a Kappa Architecture based on Kafka,

The architecture is also simple, but gives the idea of merging two pipeline into a single pipeline, and the more critical solution is to replace batch pipeline with the well-designed streaming system, a great inspiration for the author

The author's evaluation of this architecture, I ' m not convinced that notion itself requires a name, and I fully the idea in principle.

Quite Honestly, I ' d take things a step further.
I would argue that well-designed streaming systems actually provide a strict superset of batch functionality.

The author advances that streaming is a superset of batch, that is, the era does not need batch, should retire

Steaming to beat batch, you just have to do two things,

correctness -this gets you parity with batch.

As long as you do this, you can at least be equal to batch

At the core, correctness boils consistent storage.
Streaming systems need a method for checkpointing persistent state over time (something Kreps have talked about in He why the local state was a fundamental primitive in stream processing post), and it must was well-designed enough to remain Consistent in light of the machine failures.

If you ' re curious to learn more on what's it takes to get strong consistency in a streaming system, I recommend you check Out Themillwheel and Spark streaming papers.

Tools for reasoning about time-this gets you beyond batch.

To do this, you can go beyond batch

Good tools for reasoning about time is essential for dealing with unbounded, unordered data of varying event-time skew.

This is the focus of the author's discussion on how to deal with unbounded, unordered data

Because in reality, we often need to install event-time to process data, instead of following process-time

In the context of unbounded data, disorder and variable skew induce a completeness problem for event time windows:
Lacking a predictable mapping between processing time and event time, how can I determine when you ' ve observed all the D ATA for a given event time X? For many real-world data sources, you simply can ' t. The vast majority of data processing systems in use today rely on some notion of completeness, which puts them at A severe disadvantage when applied to unbounded data sets.

This question will be described in detail in 102, which is actually the content of the dataflow paper.

Data processing Patterns

Finally, the author describes the current data processing of the patterns

Bounded data

Unbounded Data-batch

Fixed windows

Sessions

The difference between this and the fixed windows above, man-made partition fixed Windows will cut off sessions, medium red

Unbounded data-streaming

In reality, unbounded data often has two characteristics,

    • highly unordered with respectto the event times, meaning you need some sort of time-based shuffle in your pipeline I f you want to analyze the data in the context in which they occurred.
    • of varying event time skew, meaning you can ' t just assume you'll always see most of the data for a given event Ti Me X within some constant epsilon of time Y.

There are several ways to deal with such data,

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant-i.e., all relevant logic is data driven.

This is the simplest, time-independent application, so stateless, such as map or filter, belong to this case.

This scenario is nothing to say, any streaming platform can be handled very well

Approximation algorithms

The second major category of approaches is approximation algorithms, such as approximate top-n, streaming k-means, etc.

Windowing by processing time

There is a few nice properties of processing time windowing:

    • It ' s simple.
    • Judging window completeness is straightforward.
    • If you're wanting to infer information about the source as it's observed, processing time windowing is Exactly what do you want.

Windowing by event time

Event Time Windowing is "What if use time" need to observe a data source in finite chunks that reflect the times at WHI CH Those events actually happened.

It ' s The gold standard of windowing. Sadly, most data processing systems on use today lack native support for it.

This approach is used by the author, he thought is the gold standard of windowing, and the current system is often native not supported, because it is difficult, this is the author's main contribution, 102 will be described in detail

Of course, powerful semantics rarely come for free, and event time Windows is no exception. Event time Windows has a notable drawbacks due to the fact, windows must often live longer (in processing time) tha n the actual length of the window itself:

Buffering: Due to extended window lifetimes, and more buffering of data is required.

Completeness: Given that we often has no good how to knowing when we ' ve seen all the data for a Given window, how do we know W Hen the results for the window is ready to materialize? In truth, we simply don ' t.

The World Beyond Batch:streaming 101

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.