Kafka introduction,

Last Update:2018-02-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka introduction,

Kafka is useful for building real-time data pipelines and stream applications.

Apache Kafka is a distributed stream platform. What does this mean?

We consider that the middleware has three key capabilities:

What is the use of Kafa?

It is used for two types of applications:

So how does Kafka implement these tasks?

First, let's look at some concepts:

Kafka runs in clusters.
The Kafka cluster stores stream records in categories called topics.
Each record is composed of a key, a value, and a timestamp.

Kafka has four core APIs:

The Producer API allows an application to publish a stream record to one or more topics.
The Consumer API allows an application to subscribe to one or more topics and process record streams.
The Streams API allows an application to act as a stream processor. It consumes an input stream from one or more topics and produces one output stream to one or more topics, effectively converting the input stream to the output stream.
The Connector API allows you to build and run reusable producers or consumers that can connect Kafka to existing applications or data systems. For example, connect to a relational database.

In Kafka, the communication between the client and the server is simple, high-performance, and based on the TCP protocol.

Topics and Logs

Kafka providesA stream of records -- the topic

A topic is a classification and a record is published here. In Kafka, topics always have multiple subscribers. Therefore, one topic can have multiple 0, 1, or multiple consumers to subscribe.

Each topic and Kafka cluster maintains a partition log, which looks like this:

Each partition is an ordered and unchangeable sequence. These sequences are structured commit logs. Each record in the partition is specified with a sequence id number, which is called offset and is the unique identifier of each record in the partition.

The Kafka cluster operates on all the release records, whether or not these records have been consumed, which can be configured. For example, if this retention policy is set to 2 days, a record can still be consumed after it is published for 2 days, but it may be discarded to free up space. The performance of Kaka is efficient and stable. It is not a problem to store all data for a long time.

In fact, metadata is stored in every consumer. The most basic thing is offset or position, which are saved in the form of consumer logs. Offset is controlled by the consumer. Generally, a consumer increases its offset. In fact, since the consumer can control its position, it can consume records in any order. For example, a consumer can reset to an old offset to process previous records.

This feature means that kafka consumers are very cheap-they can come and go freely without the influence of clusters and other consumers.

Log partitions have several purposes. First, logs can be scaled, and the size of logs can exceed the size of a single server. Each Independent partition must be installed on the server of the host where it is located, but a topic can have multiple partitions, it can process any amount of data. Second, as a parallel unit.

Distribution

Log partitions are distributed on servers in the cluster. Each server processes data and requests and shares these partitions. For fault tolerance, each partition is cross-replicated and the data of the replica can be configured.

Each partition has one server acting as the "leader", and zero or multiple servers acting as "followers ". The leader processes all read and write requests, while the follower passively copies data from the leader. If the leader fails, one of the follower will automatically become a new leader. Each server that acts as a leader may be the leader of some partitions and the follower of some other partitions. Therefore, the load in the cluster is balanced.

Producers

The producer publishes data to topics. The producer is responsible for selecting which record to specify to which partition. Server Load balancer can be implemented in a round-robin mode.

Consumers

The consumer usesConsumer groupEach record in a topic is delivered to a consumer instance in a consumer group. Each consumer instance is processed separately.

If all consumer instances are in the same consumer group, the record is valid for load balancing between consumers.

If all consumer instances are in different consumer groups, each record is broadcast to all consumers.

, A Kafka cluster has 2 servers, 4 partitions (P0-P3), and 2 consumer groups. Consumer group A has two consumer instances and group B has four.

Generally, we will find that a topic has many consumer groups, each of which is a "logical subscriber" (logical subscriber ). Each group is composed of many consumer instances, which are scalable and fault-tolerant.

Kafka only provides the total order of records in the partition, but does not guarantee the order between partitions.

Guarantees

A High Level kafka provides the following guarantees:

Messages sent by the producer to the topic partition are appended in the order they are sent. Therefore, if both M1 and M2 are sent by the same producer, and M1 is located after the first M2, the offset of M1 in the partition is smaller than M2, the M1 log is also in front of M2.
The order in which a consumer Instance sees records is in the order in which records are stored in logs.
Assuming that the replica factor of the topic is N, we can tolerate failure of a N-1 service without losing any commit logs.

Kafka as a Messaging System

There are two traditional messaging models: queuing and publish-subscribe (point-to-point queue model and publish/subscribe model ). In the queue model, each message can be consumed only once. In the publish/subscribe model, messages can be broadcast to all consumers. The two models have their own advantages and disadvantages.

The concept of consumer groups in Kafka is derived from these two models. This is similar to the queue model. Consumer groups allow separate processing on the processing set. Like the publishing and subscription model, kafka allows you to broadcast messages to multiple consumer groups.

The advantage of the Kafka model is that each topic has the following attributes: It can be scaled and processed, and it has multiple subscribers.

Compared with traditional message systems, Kakfa has a stronger order guarantee.

Traditional queues store records in order on servers, and multiple consumers consume records in the order in which records are stored. Although the server processes records in order, the records are asynchronously delivered to consumers, so they may arrive at consumers in different order. That is to say, a certain percentage of order will be lost during parallel consumption. In a message system, the concept is "exclusive consumer". It allows only one consumer to process the message, which means serial processing.

Kafka does better. It has a concept called "parallelism-the partition-within the topics" (parallel under the topics partition ). Kafka ensures the order and provides load balancing. All of this is due to specifying the topic partition for consumers in the consumer group, so that each partition can be precisely consumed by a consumer. Note that the number of consumers in a group cannot exceed the number of sub-partitions.

Kafka as a Storage System

Any message queue allows message publishing and message consumption decoupling. In this process, it plays the role of a storage system. Data written to kafka is written to the disk and copied. Kafka allows the producer to wait for confirmation, so that this write operation can be considered complete only when the data is completely copied and the data has been persisted.

Whether you are using 50 kb or 50 TB data persistence, the operations performed by kakfa are the same.

You can use kafka as a special Distributed File System. It provides high-performance, low-latency log storage and replica submission.

Kafka for Stream Processing

It is not enough to read and write the stored stream data. You must be able to process the stream in real time.

The stream processor of kafka continuously receives streams from the input topic. Then, it processes the input and generates continuous stream data to the output topic.

For example, a retail application may collect sales and logistics data as an input stream, and then calculate based on the data to generate an output stream for re-order and price adjustment.

Reference http://kafka.apache.org/intro

This section focuses on

1. kafka is a distributed stream platform.

2. What is the use of kafka?

Construct real-time stream Data Pipeline
Build Real-time stream applications

3. Basic Concepts

Kafka runs as a cluster. A cluster can be one or more servers.
Records are stored by category. These categories are called topics. It can be simply understood that data is stored in topics.
Each record is composed of key, value, and timestamp.

4. Core APIs

Producer: Producer, publishing records (messages) to one or more topics
Consumer: A Consumer that subscribes to one or more topics.
Streams: A stream processor that consumes input Streams from one or more topics and generates output Streams to one or more topics.
Connector: Build reusable producers or consumers that can connect to external applications or data systems

5. Topics and logs

5.1 A topic is a classification. A record is published to a topic. A topic always has multiple subscribers. A topic can have 0 or 1 consumers.

5.2. Each topic has a partition log. Each partition is an ordered and unchangeable record sequence, and records are continuously appended to the partition.

5.3 record is a structured commit log (a structured commit log)

5.4. Each record in the partition is specified with a unique idnumber, called offset. Offset is controlled by the consumer.

5.5. kafka stores all published records, whether or not they have been consumed. The retention period is configurable.

5.6. Why partition? Partitions can break through the limits of a single server.

6. Distribution

6.1 log partitions are distributed on servers in the cluster. Each server under these partitions shares the data and requests, and each partition is copied to other servers, the number of copies is configurable.

6.2 In each partition, one server acts as the "leader" and zero or multiple roles act as "follower. The leader is responsible for processing all read/write requests, while the follower passively copies from the leader. If the leader dies, one of the follower automatically becomes the leader. A server may be the leader of one of the partitions where it is located, or the follower of other partitions. (PS: In this section, a partition may consist of one or more servers. one of the servers that make up a partition is the leader role, and the others are the follower roles, the leader role is responsible for all read and write operations on this partition, while the follower passively copies data from the leader. In addition, a server may be the leader in this partition, but it may also be the follower of another partition .)

7. Producer

The producer is responsible for specifying the topic to which the record is published.

8. Consumers

8.1 consumers use consumer group names to mark themselves. Each consumer group is a logical subscriber of topics.

8.2 each record published to topics will be delivered to a consumer instance in each consumer group subscribed.

8.3 Each consumer group consists of multiple consumer instances, and the number of instances is scalable.

9. Guarantee

Messages in the same partition of the same topic sent by the same consumer are appended in sequence. Assume that M1 and M2 are messages sent by the same partition, and M1 is sent first, the offset of M1 must be smaller than M2.
Consumers View Messages in the order of message storage.
Assuming the replica factor is N, no record will be lost even if a server with a N-1 goes down

10. kafka as the message system

10.1 each record in the topics will only be delivered to one consumer instance in each consumer group subscribed. That is to say, assume that two consumer groups subscribe to this topics, and each group has three consumer instances, in this topics, each record is shipped to two groups. Each group can only consume the record after receiving the record.

10.2 based on the first point, if all consumers subscribing to this topics belong to the same group, this is equivalent to the point-to-point queue model; if all the consumers subscribed to belong to different consumer groups, this is equivalent to the publishing and subscription model.

10.3. kafka ensures that the storage order of messages sent by the same producer to the same partition under the same topics is consistent with the order in which messages are sent, consumers who consume this partition view the same order of messages as they store messages.

10.4. in parallel in the subpartition of the topic, kafka ensures both sequential and load balancing. All this is due to the consumer in the specified partition to the group, so that each partition can only be consumed by one consumer in the consumer group. By doing so, you can ensure that the consumer can only consume data in order.

11. Two important figures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kafka introduction,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kafka introduction,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support