[Translation and annotations] Kafka streams Introduction: Making Flow processing easier

Source: Internet
Author: User
Tags cos joins repetition apache flink apache mesos kafka connect kafka streams

Introducing Kafka Streams:stream processing made simple

This is an article that Jay Kreps wrote in March to introduce Kafka Streams. At that time Kafka streams was not officially released, so the specific API and features are different from the 0.10.0.0 release (released in June 2016). But Jay Krpes, in this brief article, introduces a lot of Kafka streams's design considerations, and it's worth a look.

The following will not be translated completely according to the original, because then too tired ... This document is indeed very long, and Jay Kreps writes a lot of repetition, and some places have some meaning. But what he wants to say is quite easy to figure out.

I am delighted to announce the preview of Kafka's new feature-kafka streams. Kafka streams is a Java library that uses Apache Kafka to construct distributed streaming handlers. It was previously part of the Kafka 0.10 version, and its source code was under the Apache Kafka project.

A streaming handler built with Kafka streams looks like this:

It is important to note that the Kafka streams is a Java library, not a stream processing framework, which is significantly different from the Strom stream processing framework

This program differs from the 0.10.0.0 version in detail. For the Kafka 0.10.0.0 version of the Kafka Streams, the actual operational examples can be found under Kafka Streams project examples package. It is important to note that this example uses a lambda expression, which is a feature of JAVA8.

In the structure of Kstream, it embodies its close relationship with Kafka. For example, the element of its default input stream is K,v, and so is the output stream, so you need to specify the Serde of K and V, respectively, when constructing the input and output stream. The Kstream API uses a lot of aggregate functions, such as map, FlatMap, Countbykey, and so on, which can also be called Kafka Streams's DSL.

Although it is just a library, Kafka streams directly solves many of the challenges that you will encounter with streaming:

    • Processing of one event at a time (not microbatch), delay in milliseconds
    • Stateful processing, including distributed joins and aggregation
    • A handy DSL
    • Use a dataflow-like model to handle windowing problems with scrambled data
    • Distributed processing, and has a fault-tolerant mechanism, can be quickly implemented failover
    • There is the ability to re-process the data, so when your code changes, you can recalculate the output.
    • There is no time to roll the deployment.

For those who want to skip the preface and want to read the document directly, you can go directly to Kafka Streams Documention. The purpose of this blog is to talk less about "what" (because the relevant documentation will be described in detail) and talk more about "why".

But what exactly is it?

Kafka streams is a library used to build a stream handler, especially if its input is a Kafka topic, and the output is another Kafka topic program (either invoking an external service or updating the database, or something else). It allows you to do this in a distributed and fault-tolerant manner.

There's a lot of interesting work going on in the field of streaming, including open-source frameworks like Apache Spark, Apache Storm, Apache Flink, and Apache Samza, including Google's DataFlow and AWS LAMBD A-like proprietary services. All, I need to make a list of the similarities and differences between Kafka streams and these things.

Frankly, in this ecosystem, there is a lot of clutter and innovation in the open source community. We are excited about all these different layers of processing (processing layer): Although sometimes it can be a bit confusing, the level of technology is really making progress quickly. We want Kafka to be the right data source for all of these processing layers. We want Kafka. The vacancy for streams filling is not so much in the area of analysis that these frameworks focus on, but in building the core applications and microservices that are used to process streaming data. I'll go into these differences in the next section and start explaining how Kafka streams makes this type of program easier.

Hippies, stream-processing, and Kafka

If you want to know if a system design works well in real-world situations, the only way to do that is to build it, use it for real-world programs, and see what's lacking. Prior to my work at LinkedIn, I was fortunate to be a member of the team that designed and structured the flow-processing framework Apache Samza. We put it into a series of internal programs, provided support for it in production, and helped open it up as an Apache project.

So, what have we learned? A lot. One of the key illusions we had was that stream processing would be used in a way similar to the real-time mapreduce layer . In the end, we found that most of the applications where convection processing was needed were actually quite different from what we typically do with hive or spark jobs, which are closer to an asynchronous microservices rather than a quick version of a bulk analysis task .

What do I mean by that? I mean most stream handlers are used to implement the core business logic, not to analyze the business .

The problem of building such a stream handler is very different from the analysis or ETL problems that a typical mapreduce or spark task needs to solve. They require the process that the usual program undergoes, such as configuration, deployment, monitoring, and so on. In short, they are more like microservices (I know it's a noun that has been given too much meaning) rather than a mapreduce task. Kafka replaces the HTTP request to provide an event stream for such a stream handler (streams).

Previously, people used Kafka to construct a stream handler with two choices:

1. Develop directly with consumer and producer APIs
2. Adopt a mature Stream processing framework

These two options are inadequate. When using the Kafka consumer and producer APIs directly, if you want to implement more complex logic, like aggregations and joins, you have to do it yourself on the basis of these APIs, or it's a bit of a hassle. If you use a streaming framework, you add a lot of complexity, which can be difficult for debugging, performance optimization, and monitoring. If your program has both synchronous and asynchronous parts, you will have to separate the flow processing framework from the mechanism you use to implement your program.

Though, things are not always like this. For example, you already have a spark cluster for running batches, and you want to add a spark streaming task, adding a little extra complexity. However, if you are dedicated to a single spark cluster for an application, this does greatly add complexity.

However, our position with Kafka is that it should be the basic element of stream processing, so we want Kafka to provide you with the ability to get rid of the flow-processing framework, but with very little complexity.

Our goal is to make flow processing simple enough to make it a mainstream programming model for constructing asynchronous services. There are a number of ways to do this, but there are three big aspects that you want to discuss in depth in this blog:

These three aspects are more important, so the English is also listed.

    1. Making Kafka Streams A fully embedded library with no stream processing Cluster-just Kafka and your application.
    2. Fully integrating the idea of tables of state with streams of events and making both of these available in a single concep Tual Framework.
    3. Giving a processing model that's fully integrated with the core abstractions Kafka provides to reduce the total number of Moving pieces in a stream architecture.

    1. Makes the Kafka streams an embedded libray, and does not depend on any stream processing framework.
    2. Tightly combine "stateful tables" and "event streams" to make them available within the same conceptual framework.
    3. Provides a fully integrated processing model of the core abstraction and Kafka to reduce the number of indeterminate parts within the streaming architecture.

Each aspect is discussed separately below.

Simplified point 1: Flow processing without frame-dependent

Kafka streams makes it easier to build a flow-processing service 1th: It doesn't depend on clusters and frameworks, it's just a library (and a small library). You only need Kafka and your own code. Kafka will coordinate your program code so that they can handle failures, distribute the load between different program instances, and re-balance the load as new program instances join.

I'll talk about why I think it's important, and what we've experienced before, to help understand the importance of this model.

Cure The Hangover of MapReduce

I talked about our experience of constructing Apache Samza, and the distance between what people actually want (simple streaming services) and what we build (real-time mapreduce). I think the dislocation of this concept is universal, after all, many of the things that stream processing does is take over some of the power from the batch world for low latency areas. The same MapReduce legacy affects other mainstream streaming platforms (Storm, spark, etc.) as they affect Samza.

In LinkedIn, many production data processing services are in low latency areas: Email, user generated content, new message feedback, etc. Many other companies should have similar asynchronous services, such as retailing, which needs to sort, re-price, and then sell, which is the core of real-time data for financial firms. Most of these businesses are asynchronous, and there is no problem with rendering the page or updating the mobile app's screen (these are synchronized).

So why is it so tedious to build such a core application on top of a streaming framework such as Storm, Samza, Spark streaming?

A batch framework, like MapReduce or spark, needs to solve some difficult problems:

    • It must manage many short-term tasks on top of a machine pool and efficiently dispatch resource allocations in a cluster
    • In order to do this, it must dynamically package and physically deploy your code, configuration, dependent libraries, and all the other necessary things to the machine that will execute it.
    • It must manage processes and implement isolation between the different tasks of a shared cluster.

Unfortunately, in order to solve these problems, the framework has to become very intrusive . In order to be fault tolerant and extensible, the framework controls how your programs deploy, configure, monitor, and package.

So what's the difference between Kafka streams?

Kafka streams a lot more attention to the problems it wants to solve. It does the following things:

    • When a new instance of your program joins, or there is already an instance exiting, it will rebalance the load to be processed
    • Maintaining the local state of a table
    • Recovering from errors

It is implemented using the same set of management protocols (Group Manager Protocol) provided by Kafka for ordinary consumer. Kafka streams can have some local state, stored on disk, but it's just a cache. If the cache is lost, or if the program instance is moved to a different location, the local state can be rebuilt. You can use Kafka streams this library in your program, and then start any number of instances of the program you want, Kafka will partition them, and load balance between those instances.

This is important to implement simple things like rolling restart (rolling restart) or an extension without downtime (no-downtime expansion). In modern software engineering, we think of this as a should, but many flow processing frameworks do not.

Dockers, Mesos, and kurbernetes, my goodness!

The reason for the separation of packaging and deployment from the flow-processing framework is that the area of packaging and deployment itself is undergoing its own revival. Kafka streams can use the classic Wussy tools, such as puppet, Chef, salt to the deployment, can be started from the command line. If you are young and trendy, you can also make your program into a dock mirror, or you may not be such a person, then you can use the war file.

However, for those looking for a more flexible approach to management, there are many frameworks where the goal is to make the program more flexible. Here's a list:

    • Apache Mesos with a framework like Marathon
    • Kubernetes
    • YARN with something like Slider
    • Swarm from Docker
    • Various hosted container services such as ECS from Amazon
    • Cloud Foundry

The ecosystem is as focused as the flow-processing ecology.

Indeed, the problem Mesos and kubernets want to solve is to spread the process across many machines, and that's what storm is trying to solve when you're distributing a storm mission to the storm cluster. The point is that the problem is eventually found to be quite difficult, and these general-purpose frameworks, at least those that are excellent, will do much better than others-they have the execution of a reboot like maintaining parallelism, stickiness to the host (sticky host affinity), Real Cgroup-based isolation, Docker packaging, fancy UI, and more.

You can use Kafka Streams in any of these frameworks, just as you would for other programs, which is an easy way to implement dynamic and resilient process management. For example, if you have Mesos and marathon, you can use the marathon UI to launch your Kafka streams program directly, and then dynamically expand it without service interruption, Meos will manage the process, Kafka will manage and load the balance and maintain the status of your task process.

The overhead of using one of these frameworks is the same as the cluster management part of using a framework like storm, but the advantage is that all of these frameworks are optional (of course, Kafka streams can work well without them).

Simplified point 2:streams Meet Tables

Another key way Kafka Strems is used to simplify handlers is to tightly combine the concepts of "table" and "flow". We simplified this idea in the previous "turning the database Inside Out". That sentence captures how the system as a result of the process is recast and the relationship of its data and how it should be changed in data, such a point. For the sake of this, I'll look back and explain my definition of "table" and "stream" and how to combine the two to simplify common asynchronous programs.

A traditional database is about storing state in a table. Traditional databases do not do well when it is necessary to react to the flow of events. What is an event? The event is just something that has already happened-it can be a click, a sale, a dynamic from a sensor, or an abstraction into anything that happens in the world.

A stream handler, like Storm, departs from the other end of the equation. They are designed to handle the flow of events, but generating state based on a stream is only added later.

I think the basic problem with asynchronous programs is to combine the tables representing the state of the current world with the event streams that represents what is happening. Frameworks need to handle how they are represented, and how they are transformed between them.

Why are these concepts relevant? Let's give a simple example of a retailer. For retailers, the core stream of events is selling goods, ordering new items, and receiving orders. The "Inventory table" is a "table" that is added and reduced by selling and receiving streams based on the current stock volume. For retailers, two key streaming actions are ordering goods when inventories start to fall, and adjusting commodity prices based on supply and demand.

Tables and streams are one-sided

Before we begin to study flow processing, let's try to understand the relationship between the table and the flow. I think it's best to cite Pat Helland about databases and logs:

The transaction log records all changes to the database. The high-speed append operation is the only way the log changes. From this point of view, the database holds the cache of the most recent record in the log. The facts are recorded in the log. A database is a cache of a subset of logs. The cached part is exactly the latest value for each record, and the index value that originates from the log.

What the hell is that about? Its meaning is actually at the core of the relationship between the table and the stream.

Let's start with this question: what is a stream? This is simple, the flow is a series of records. Kafka models a stream into a log, that is, an endless series of health/value pairs:

Key1 = Value1key2 = Value2key1 = Value3 ...

So, what is a table? I think we all know that a table is something like this:

Key1

Value1

Key2

Value3

Where value can be a lot of columns, but we can ignore the details and simply think of them as kv pairs (adding more columns does not change what will be discussed).

But when our flow continues to update over time, new records appear, which is just the snapshot of our table at a certain time. How does the table change? They are updated. A table is actually not a single thing, but a series of things like this:

Time = 0

Key1 Value1

Time = 1

Key1

Value1

Key2

Value2

Time = 2

Key1

Value3

Key2

Value2

...

But this sequence has some repetition. If you remove the unchanged rows and record only the updates, you can represent the table with an ordered update sequence:

Put (Key1, value1) put (Key2, value2) put (Key1, value3) ... Or, we can get rid of the put operation, because only put operation, then we got: Key1 = Value1key2 = Value2key1 = Value3 ...

But doesn't that turn out to be a stream? This type of flow is often called changelog because it shows the update sequence, which records the most recent value for each record in the order in which it was updated.

So, a table is a special view above the stream. It may be strange to say this, but I think this form of table is as reflective of the essence of "what the table really is" as the rectangular table in our mind. Or, it's actually more natural because it captures the concept of "change over time" (think about it: what data really doesn't change?). )。

In other words, as Pat Helland points out, a table is the cache of the most recent value of each key in a stream.

In the terminology of the database: a pure flow is that all updates are interpreted as a table of INSERT statements (because no records replace existing records), and a table is a stream in which all changes are interpreted as update (because all existing rows that use the same key will be overwritten).

This double-sided has been built into the Kafka for some time, and it is presented in the form of compacted topics.

Table and Stream processing

Well, that's what the flow and the table are. So, what does this have to do with flow processing? Because, eventually, you will find that the relationship between the flow and the table is the core of the flow processing problem.

I've given you an example of a retailer, in which the results of the two streams of "goods arriving" and "merchandise sold" are an inventory table, and changes to inventory tables trigger processing such as "order" and "Change Price".

In this case, the inventory table is certainly not just something created in the flow-processing framework, they may already be in a database. Well, capturing a stream composed of changes into a table is called change capture, and the database does this. The format of the change capture data stream is the changelog format I described earlier. This type of change capture is something you can easily get done with Kafka Connect, Kafka Connect is a framework for data capture and is a new addition to the Kafka 0.9 version.

By building the concept of a table in this way, Kafka allows you to derive values from the tables that you get from the stream of changes. In other words, it allows you to process the stream of changes in the database just as you would with clickstream data.

You can think of this ability to trigger calculations based on database changes as a database-like trigger and materialized view feature, but this functionality is not limited to a database or PL/SQL, it can be executed at the data center level, and can work on any data source.

Joins and aggregate are also tables

We're here. How can one turn into an update stream (also a changelog), and use Kafkastreams to calculate something based on it. But the two-sided nature of the table/stream is also possible in the opposite way.

If you have a user click Stream, you want to calculate the total number of clicks for each of the split households. Kafkastreams allows you to calculate this aggregation (aggregation), and the number of hits per user that you have calculated is a single table.

When implemented, Kafka streams stores this stream-based table in a local database (the default is Rocksdb, but you can plugin other databases). The output of this job is the hcnagelog of this table. This changelog is for highly available computing: When a compute task fails and then restarts elsewhere, it can continue from where it was before the failure, not the entire recalculation, but it can also be consumed and processed by other kafkastreams processes, You can also use Kafkaconnect to guide other systems.

This native storage-enabled architecture has already appeared in Apache Samza, and I wrote an article about this from a system architecture perspective. The key innovation of Kafkastreams and Apache Samza is that the concept of a table is no longer a low-level infrastructure, but a one-class member like stream. Streams is represented by the Kstream class in the programming DSL provided by Kafka streams, and the table is represented by the Ktable class. They have some common operations, and can be converted to each other as the two-sided nature of the table/stream implies, but they are also different. The next few sentences are more difficult to understand, if you feel that the understanding is not correct, you can see the original text. For example, when performing a fetch operation on a ktable, Kafka streams knows that the ktable bottom is a stream of updates and will therefore be processed based on this fact. This is necessary because the semantics of calculating sum on a changing table is completely different from the semantics of calculating sum for a stream composed of immutable updates. Similarly, the syntax for join two streams (such as click-and-flow) is completely different from the semantics of joins for a table and a stream (such as clickstream and account tables). By modeling these two concepts in a DSL, these details are automatically divided. I think the meaning of this passage is that Kafka streams will take into account that the ktable layer is actually a stream, so it uses a special calculation method that differs from the aggregation and join that calculates the ordinary table)

Windows and tables

window, time, and disorderly events are another difficult aspect of the flow processing domain. Surprisingly, however, it can be proved that a simple solution falls on the concept of a table. People who are closely focused on stream processing should have heard of the concept of "event time", which is frequently discussed by the Google Dataflow team in a very persuasive way. The question they catch is: If the event is reached out of order, how do you do the windowed operation? Chaotic data is unavoidable in most distributed scenarios, as we do not guarantee the order of data generated in different data centers or on different devices.

In the case of retailers, an example of this windowed computing is the calculation of the number of items sold in a 10-minute window. How can we know when this window is over? How do you know that all the sales events in this time period have arrived and have been processed? If these are not certain, how can we give the final value of the total number of each commercial port sold? Whenever you make an answer based on the number of statistics at this time, it can be too early, and there may be more events arriving in the future, making your previous answer wrong.

Kafka streams makes it easy to deal with this problem: windowed aggregation semantics, such as count, represents the count of "so far" for this windows. As new data arrives, it remains updated and the dirty receiver can decide when to complete the statistics. Yes, the concept of this updatable quantity seems inexplicable to its second familiarity: it's not something else, it's a table, and the updated Windows is part of the key. Naturally, downstream operations know that this stream represents a table and that they are processed when those updates arrive.

The same mechanism is used to compute and process the windowd aggregation of the chaotic events on top of the database change flow, which I think is very elegant. The relationship between this table and the flow is not invented, and in the old stream-processing articles, such as CQL, it has been shown in many details, but this theory has not fused into most real-world systems-when database processing tables, stream processing systems process streams, and not most do not make both as a class citizen.

Tables + embeddable libary = Stateful Services

One of the evolving features based on some of the features I've raised above may not be so obvious. I discussed how Kafka streams allows you to maintain a table based on streaming performance transparently in Rocksdb or other local data structures. Because the state of the process is physically present in your program, this opens up the possibility of another exciting new use: allowing your program to query the state directly.

We're not exposing this interface at this time-we're also focused on making the streaming API stable in fact, but I think it's a fascinating architecture for some specific types of data-sensitive programs.

This means that you can build, for example, a rest service that embeds the Kafka streams, which can directly query the local aggregation results obtained by the data flow over-stream processing operation. The benefits of this type of stateful service are discussed here. It's not the right thing to do in all areas, you usually just want to export the results to an external database. However, if every request for your service requires access to a lot of data, it can be useful to put this data in local memory or in a fast local rocksdb instance.

Simplify point 3: Simple is beauty

Our highest directory of all of these is making it easier to build and manipulate flow handlers. Our belief is that stream processing should be a mainstream way of building applications, and a large part of what companies do is in the field of async, where streaming is used to do this. But to make this a reality, we also need to make Kafka streams easier and more dependent on this. Part of this simplification of operations is getting rid of dependencies on external clusters, but it also simplifies other areas.

If you look at how people build stream handlers, you'll find that, in addition to the framework itself, stream handlers tend to have a high degree of architectural complexity. This is a typical flow handler for the architecture diagram.

Figure

There are so many changing parts here:

    • Kafka itself
    • A flow-processing framework, such as Storm or spark, or something else, that typically contains a series of master processes and daemons on each node.
    • Your actual flow-processing job
    • A secondary database for finding and aggregating
    • A database that is queried by an application that receives output from a stream processing task.
    • A Hadoop cluster (which itself has a series of changing parts) to re-process the data
    • A request/Response program that serves your users ' or customers ' requests

It's not just what people want, it's often unrealistic to get a lot of stuff down. Even if you already have all the parts of the architecture, it's very difficult to put it all together, to monitor it, and to be able to perform all of its functions.

One of the most gratifying things about Kafka streams is that it has very few core concepts and that they run through the system.

We've talked about some big points: get rid of the extra stream-processing clusters, and fully integrate the tables and stateful processing into the stream processing itself. With Kafka Streams, this architecture can be thin:

But making flow processing simple is far more than these two points.

Kafka streams is very small because it is built directly on top of the Kafka operation. Its entire code base is less than 9,000 lines. If you like, you can finish it. This means that you will encounter complexities other than Kafka own producer and consumer that are easily borne.

This has a lot of small implications:

    • The output and processing are just Kafka topics
    • The data model is Kafka from beginning to the Keyd record data model
    • The partitioning model is the Kafka partition model (this is how the "data Partition" is implemented), and the Kafka Partitionor can also be used for streams.
    • The group membership mechanism for managing partitioning, allocation, and survival is also the group membership mechanism of Kafka
    • The table and other stateful computations are log compacted topics (meaning compacted topics).
    • For producer, consumer and stream processing programs, metrics are unified, so as long as the monitoring of the time to grab a metrics on the line (meaning that the three parts of the same metrics mechanism).
    • The position of your program is maintained using offset, just like Kafka consumer.
    • The timestamp used to do the windowing operation is the timestamp mechanism that is added to the Kafka, which can provide you with event-time-based processing.

In a nutshell, a Kafka Strems program looks like other programs that are written directly in Kafka producer or consumer, but it's much simpler to write.

In addition to the Kafka client exposing those configurations, there are very few additional configuration items.

If you change the code and want to re-process the data with the new logic, you don't need a completely different system. You just have to back up your program's Kafka offsets and have it re-process the data (you can, of course, re-process it on the Hadoop side or somewhere else, but the key is you can choose not to).

Although the initial sample architecture consists of a series of independent components, and they are only partially working together, we hope that you will feel in the future that Kafka, Kafka Connect and Kafka streams are designed to work together.

What's next?

Just like the other previews, there are some features we haven't finished yet. Here are some features that will be added.

The status that can be queried

The built-in tables are then used to provide queries for the state of the program.

End-to-end semantics

The current kafkastreams inherits the message-passing semantics of Kafka's "at least once". The Kafka community is exploring how to implement messaging semantics across Kafka Connect, Kafka, Kafkastream, and other computing engines.

Support for languages other than Java

[Translation and annotations] Kafka streams Introduction: Making Flow processing easier

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.