Description
This article is a translation of Kafka streams in the Confluent Platform 3.0 release.
Original address: https://docs.confluent.io/3.0.0/streams/index.html
Read a lot of other people to translate the document, or the first translation, there is no good translation of the place also please point out.
This is the fourth article introduced by Kafka streams, the previous introduction is as follows:
http://blog.csdn.net/ransom0512/article/details/51971112
http://blog.csdn.net/ransom0512/article/details/51985983
http://blog.csdn.net/ransom0512/article/details/52038548
Kafka streams simplifies application development by leveraging Kafka's producer and consumer to provide functionality and provide parallel processing, distributed coordination, fault tolerance, and simple operation. In this section, we will describe how Kafka streams works.
The picture above shows how the Kafka streams application works, so let's take a look at the details in the following series of descriptions. 1. Topological operators
A topological operator or simple topology defines the computational logic of your streaming application, that is, how the input data is converted to output data.
SOURCE operator (Sourceprocessor): The source operator is a special type of stream processing operator that does not contain any upstream operators. It consumes the data in the topic and forwards it to the downstream stream processing operator for processing. Output operator (Sink Processor): The output operator is a special type of flow processing operator that does not contain any downstream operators. It can accept data from any upstream operator and write data to Kafka specified topic.
A stream processing application that can define one or more such topologies, but usually only one is defined. Developers can create topologies based on the above format through a low-level API or a Kafka DSL.
A topological operator is a logical abstraction that contains user-flow-processing code. At run time, the logical topology is instantiated and replicated in parallel in the application. 2. Concurrency model 2.1. Stream partitions and Tasks (Task)
Kafka streams uses the concept of partitioning and tasks as a logical unit in the concurrency model. Kafka and Kafka streams are closely linked in the concurrency concept.
The partition of each stream is a fully ordered data record in a partition of Kafka. A data record in a stream data record map comes directly from the Kafka topic. The key value of the data is Kafka and Kafkastreams, and he determines how the data is routed to a particular partition.
An operator of an application can be freely extended into multiple tasks. Specifically, Kafka Stremas creates a fixed number of tasks (task) based on the number of partitions of the application input stream, and each task assigns a list of partitions for the input stream (Kafka topic). This partitioning-to-task assignment is cured to the execution unit of each task and does not change. The task instance can process the allocated partition data according to its own topology operator, Kafka streams also allocates a corresponding buffer for each partition allocated to it, and provides a time-processing mechanism for processing a single message based on the buffer. This allows the stream's task to be handled independently and in parallel without human intervention.
Description: Sub-topology: If a Kafka streams application specifies multiple topology operators, each task will only instantiate one of the topological operators (that is, a task may correspond to only one operator, and it cannot be multiple.) )。 In addition, a single topological operator can be decomposed into a separate sub-topology as long as the sub-topology is not connected to other operators in the topology. This further expands the workload between the tasks.
A very important point is that kafkastreams is not a resource manager, but a library that can run in any streaming application; Multiple instances of an application can run on the same machine or be distributed to different nodes by the resource Manager The partition assigned to the task will never change, and if one instance of the application fails, the task is reassigned and restarted on the other instance and continues to consume data from the same partition. 2.2. Threading Model
Kafka streams allows users to configure the number of threads that are processed in parallel in the application. Each thread can perform one or more tasks independently of their topology operator.
If you want to start more stream processing threads or application instances, you only need to replicate the topology and Kafka partition processing sub-topology, you can effectively parallel processing multiple instances. It is important to note that there is no shared state between multiple threads, so there is no need for collaborative processing between threads. This makes it very simple to run a multiple-concurrency topology instance. Kafka streams uses the coordination capability of the KAFKA server in Kafka partition allocation on topic, and partition allocation is very transparent.
overview, extending the Kafka streams Stream processing application is simple, you only need to launch other instances of the application, Kafka streams will focus on the change in the number of tasks and automatically assign partition to the task. You can start the corresponding processing thread based on the number of partitions of the KAKFA input topic, with at least one partition per thread. 2.3. Example
To better understand the concurrency model of Kafka streams, let's take a look at the following example:
Suppose you have a Kafka streams application that reads data from two Topic, Topic A and Topic B, each Topic has three partitions. If we now start the application on a single node and set the number of threads to 2, we will end up with two KAKFA streams threads instance1-thread1 and instance1-thread2. Because the maximum number of partitions for input topic A and B is 3,max (3,3) = 3, Kafka streams splits the topology into three tasks by default, and distributes the six partitions evenly across three tasks. In this case, each task consumes data from one partition per topic, and each task fetches data from both partitions at the same time. Eventually, the three threads will be distributed as evenly as possible in two threads. Two threads in the current sample program, thread 1 contains two tasks, consumes data from 4 partition, the second thread contains a task, and consumes data from 2 partitions.
Now if we are going to expand this application because of the growth in data volume, we decide to start a separate process in another machine and set the number of threads to run the same application. The new thread instance2-thread1 will be created and the input partition will be redistributed.
When tasks are reassigned, the same partitions and their task and locally saved state are migrated to the new thread, and finally, the KAFKA streams has effectively balanced the load on the application.
If we want to add more instances to the application, we can expand the number of task instances and the number of input partitions as described above. After equality, if we are going to start more applications, we should increase the number of partitions for topic A and topic B so that the number of instances waiting to be allocated will be allocated to the new partition. But this is a very rare scenario. 3. Status
The so-called state store provided by Kafka can be used to save and query data in a streaming application, which is an important capability in stateful stream processing applications. Each task in the Kafka streams contains one or more state storage spaces that can be saved and queried through the API. These state storage spaces grams are ROCKSDB databases, a memory-based hashmap, or other more convenient data structures. The Kafka streams provides fault tolerance and automatic recovery capabilities based on local state.
4. Fault tolerance
The fault-tolerant mechanism of KAFKA streams is based on Kafka function. Kafka partitions are highly available and replicable, so when streaming data is stored in Kafka, it is highly available, even if the application fails. The task in Kafka streams provides fault tolerance by leveraging the failure handling capabilities of KAFKA consumer. If a task on a machine fails, Kafka streams automatically restarts the task in another instance of the application.
In addition, the Kafka streams can guarantee the reliability of local storage state data. It uses an Apache-like Samza method to keep a replicable changelog for each state in Kafka topic to keep track of status updates. Changelog are partitioned according to local storage instances, and each task has its own dedicated topic partition. Changelog topic should turn on the Kafka log compression function, so that old data can be safely removed, to prevent unlimited growth changlog. If a task on a machine fails and restarts on another machine, Kafka streams replays the changelog on that topic to the latest state of the task. These failure processes are completely transparent to the user.
Tip: To optimize the point: in order to minimize the state recovery and task initialization time, the user can configure the local state in their own application of the parents, when a task is migrated, Kafka streams will try to assign the task to the replica machine, This reduces the time spent on initialization. You can refer to the Num.standby.replicas configuration properties in the development documentation. 5. Handling Reliability
Kafka streams supports at-least-once at least one message processing mechanism on message processing. This means that if the stream processing application fails, there will be no data loss and no processing, but some data may be processed multiple times.
In a general scenario, at-least-once is acceptable, but in a special scenario, there may be a need for exactly-once and only one semantic support.
In many stream processing applications, at-least-once is perfectly acceptable, and usually, as long as the message processing is idempotent, the data is handled more than once and is completely safe and reliable. In addition, some use cases, even if they are not idempotent, allow data to be processed multiple times. For example, by adding a blacklist of IP address clicks to mitigate the impact of a DDoS attack on the infrastructure, in this scenario, some of the calculated overage is allowed because the hit rate of the malicious IP that is participating in the attack is much larger than the IP address of the normal access.
Under normal circumstances, for non-idempotent operations, such as counting, under the wing of at-least-once may result in incorrect calculation results. If the Kafka streams application fails to restart, it may repeat some data that has been processed before the calculation fails. We are planning to address this limitation and support the processing semantics of exactly-once. 6. Timestamp-based flow control
Kafka streams flow control by synchronizing the timestamp on the message record of all input streams. By default, Kafka streams provides the processing semantics for event-time. This is important in cases where the application processes historical data for a large number of streams. For example, in the case of a change in business logic, a user might want to re-process historical data, such as bug fixes. Fetching data from Kafka is easy, but without proper flow control, data processing in a topic partition can be out of sync and produce incorrect results.
As mentioned in the concept section, each message record in Kafkastreams is associated with a timestamp. Depending on the timestamp of the data in the data cache, the streams task determines when the partition data allocated by the next input stream is processed. However, the data within a single stream is not reordered when Kafka streams is processing, which destroys Kafka's delivery semantics and makes it difficult to recover from failure. This flow control is always as best as possible, and the single is not always able to perform precisely. In fact, in order to ensure a strict order of execution, the user must wait until the system has received all the traffic (which may be very impractical in actual use) or inject additional timestamp boundaries or use heuristic estimates, such as Millwheel's watermarks additional information. 7. Back pressure
Kafka streams does not use back pressure mechanism, because it does not need. Messages that are consumed from Kafka are processed in each topology or sub-topology and are written back to Kafka before processing the next message. Thus, no messages are cached between the two processing operators. In addition, the Kafka streams utilizes the Kafka consumer client, which is based on the pull message acquisition mechanism, which allows the downstream operator to control the read speed of the input data.
The same applies to topologies that contain multiple independent sub-topologies that are independently processed for message processing. For example, the following code defines two separate sub-topologies:
Stream1.to ("My-topic");
STREAM2 = Builder.stream ("My-topic");
Any data exchange between sub-topologies is done through Kafka, where there is no direct data exchange between the sub-topologies. For this reason, there is no need to use the back pressure mechanism in this case.