Apache Kafka Official Document translator (original)

Source: Internet
Author: User
Tags message queue stream api

Apache Kafka is a distributed streaming platform. What exactly does that mean?
We think of the three key capabilities of the streaming platform:
1. Let you publish a subscription to the data stream. So he's a lot like a message queue and an enterprise-class messaging system.
2. Lets you store data streams in a high-fault-tolerant manner.
3. Let your data flow out of the current processing them.

What is Kafka good at?
He is commonly used in two major categories of applications:
1. Build a reliable, real-time data flow pipeline that can reliably get data between systems or applications.
2. Build a real-time streaming application that can transform or respond to data flow.

To better understand how Kafka is doing the above, let's delve into the various postures of Kafka.

Start by understanding several concepts:
1. Kafka is run on one or more servers in a clustered manner
2. Kafka cluster data is stored in the classification we call topic.
3. Each record consists of a key, a value and a timestamp

Kafka has the following four core APIs:
1. The Producer API allows an application to broadcast a stream of data to one or more Kafka topic
2. The Customer API allows one app to subscribe to one or more topic and process the data streams that are produced to them
3. Stream API allows applications like a stream processor to consume data streams from one or more topic inputs and then produce output data streams to one or more topic, efficiently converting between inputs and outputs
4. The Connector API allows you to create and run reusable connections Kafka topics to the producers and consumers of existing applications and data systems. Give me a chestnut. Connector of a relational database can capture every change to a table


In Kafka, client-to-server communication is accomplished through a simple, efficient, language-independent TCP protocol. The version of this Protocol is backwards compatible with the old version. We have provided a Java client for Kafka, but the client can use many languages.

Topic and Logs

For the first time, let's take a closer look at the core abstraction Kafka provides for data flow---the topic

Topic is the category or feed name for the publication record. Topic is always multi-subscribed in Kafka, so a topic can have 0, one, or more consumers who subscribe to their data.

For each Topic,kafka cluster, maintain a partition log like this:


Each partition is an ordered, immutable sequence of records that is continuously added to a structured commit log. The records in the partition are assigned a sequential ID number called offset, which is the unique identifier for each record in the partition.
The Kafka cluster saves all release records, regardless of whether they are consumed or not, using a configurable save time. For example, the Save policy is set to 2 days, and within two days after the message is released, he can still be consumed and then discarded to free up space. Kafka is efficient and stable, showing respect for the size of the data, so long-time storage is not a problem.

In fact, the only metadata that is retained on a per-consumer basis is the offset or location of that consumer in the journal. This offset is controlled by the consumer: typically a consumer will have a linear offset when reading a record, but, in fact, because the location is controlled by the consumer, he can make any sequential consumption record as long as he wants to. For example, a consumer can reset to the old offset to re-process from the previous data, or jump forward to the nearest record and start spending from "now".

This combination of multi-features means that Kafka consumers are very lightweight-they come and go and don't have a big impact on the cluster and other consumers. For example, you can use our command-line tool to "tail" all topic content without changing the records that are consumed by existing consumers.

The partition in the log has many uses. One, they allow the amount of log to exceed the size of a single server. Each separate partition must be limited to the server that is running it, but a topic can have multiple partitions, so he can handle any size data. Second, they act as parallel units.

Distribution
The partitions on the log are distributed across the servers in the Kafka cluster, and each server processes data and requests for a subset of the partitions. Each partition replicates data across a configurable number of multiple machines to ensure fault tolerance.

Each partition has a server that is considered leader, with 0 or more servers as followers. Leader handles all read and write requests, followers passive replication leader. If leader is hung, one of the followers will automatically become the new leader. Each server acts as a partial partition of his own leader, and also as a follower for other partitions, so the load in the cluster is very balanced.

Producers
The producer publishes the data to the specified topic. It is the responsibility of the producer to choose which partition is assigned to topic. This can be done based on polling (round-robin) for simple load balancing, or based on semantic partitioning functions (if based on some keywords in the record).

Consumers
Consumers mark themselves with a consumer group name, and each record is posted to a topic that is passed to the consumer instance that contains each subscription consumer group.
Consumer instances can also be in separate processes in separate machines.

If all consumer instances have the same consumer group number, then the records will be efficiently load balanced in the consumption instance.
If all consumer instances have different consumer group numbers, then each record will be broadcast to all consumer processes.

A two-server Kafka cluster runs four partitions (P0-P3) with two consumer groups. Consumer group A has two consumption instances, and Group B has four.

We found that the topic had a small number of consumer groups, each representing a "logical subscriber". Each group consists of many consumer instances for scalability and high fault tolerance. This is where a subscriber is replaced by a consumer cluster for the original single process's publish subscription structure.

The way to consume in Kafka is to divide the partitions in the logs into consumer instances so that each instance is the only user of a "fair share" partition at any point in time. The process of maintaining the crew is handled dynamically by the Kafka protocol. If a new instance is joined to a group, they will take over some partitions from members of the other group, and if one instance hangs, his partition will be assigned to the existing instance.

It is important to note that the number of consumer instances in consumer Grup cannot be more than the number of partitions.

Kafka only provides a fully sorted record within a partition, not between different partitions in the same topic. Combining the ordering of each partition internal message data with the ability to divide multiple partitions by key is sufficient for most applications.

However, if your application requires a complete, ordered queue based on all the message data, it can also be achieved in a special way. That is, a topic is divided into only one partition, and there can be only one consumer process in the consumer group.

Guarantees
(1) A message sent by a producer to a partition of the specified topic will be appended to the queue in the order in which the messages are sent. Therefore, for a producer to send messages M1 and M2,m1 will have a smaller offset than M2, and will appear earlier in the log.
(2) A consumer instance sees the same sequence of messages that they store in log.
(3) for a topic with a replication factor of n, you can tolerate N-1 server downtime without losing data.

Apache Kafka Official Document translator (original)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.