Introduction to Apache Kafka

Last Update:2018-07-21 Source: Internet

Author: User

Tags server hosting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

three important functions of streaming platform

1.  Publish and subscribe streams, in this context it is similar to Message Queuing or enterprise-class messaging System
2.  Fault-tolerant Way store stream
3. Process Flow

What are the advantages of Kafka

It is mainly used in the following two major categories:

1.  construct a real-time flow data pipeline to obtain stream data between application and system.
 2.  Build a real-time streaming application that converts or corresponds to a data stream.

In order to understand how Kafka do these things, we delve into Kafka's capabilities from the bottom up.

first, a few concepts

Kafka is run as a cluster on one or more services.
The Kafka cluster stores streams by subject topics, each containing a key, a value, and a timestamp. (The following we call the message)

Four core APIs:

The Product API allows applications to post messages to one or more topics.
The consume API allows applications to subscribe to messages for one or more topics and to process the generated stream of records.
The Streams API allows an application to act as a stream processor, receiving an input stream from one or more topics, outputting an output stream of one or more topics, effectively converting an input stream into an output stream.
The Connector API allows you to build and run reusable producers or consumers and connect message topics to applications or data systems.

For example, a relational database connection can get all the changes to a table.

The Kafka client communicates with the server-side communication using a simple, high-performance, language-independent TCP protocol that is backward compatible with the TCP protocol version. The official website provides Kafka Java version of the client, but also provide other language clients. topics and logs topics and logs

First, introduce the topic: a topic is a category of messages. Topics in Kafka are always used by multiple users, meaning that a theme can be used with 0, one, or more.
For each topic, Kafka maintains a partition log, as shown in the following illustration:

Each partition is a sequential, immutable queue that is continuously appended to the structured commit log. The records for each partition are assigned an ordinal ID number, called the unique offset of the record within the partition.
The Kafka cluster retains all published records, regardless of whether the message is consumed or not, using a configurable retention period. For example, if the retention policy is set to two days, it can be consumed within two days of posting the message, and then the freed space will be destroyed. Kafka's performance is a constant level independent of the amount of data, so it's not a problem to keep too much data.

In fact, the consumer's metadata is the offset in the log. This offset is controlled by the consumer: Typically, when a consumer reads a record, the offset is increased linearly, and because the position is controlled by the consumer, it can consume records in any order. The consumer can then reset the older offsets to read the old data repeatedly, or skip the most recent record and read the updated record.
This feature suggests that Kafka consumers are lightweight and flexible, and that there is no consumer impact on the cluster. You can use command-line tools to "tail" messages on any topic without affecting other consumers who are consuming messages.

A log partition has several purposes. First, it allows the log space to scale to fit a single server, each individual partition must fit the server hosting it, but the theme can have multiple partitions, so it can handle any number of data. Second, as a unit of parallel computing, distribution distributed

The partitions of the log are distributed across the servers in the Kafka cluster, and each service handles the data on the partition to which it is divided. Partitions can be configured to replicate to other servers to achieve fault tolerance.
Each partition has a server that acts as a "leader" leader, with 0 or more servers acting as "followers". The leader handles all read and write requests to the partition and is passively replicated by the leader. If the server that serves as the leader is down, pick a new leader from the leader's cluster. Each service is the leader of some partitions, and other services are followed to achieve load balancing in the cluster. producers Producers

The producer publishes the data to the subject of their formulation, and the producer is responsible for selecting which partition in the topic. This can be done simply by looping the load, or by using semantic partitioning (which marks the keyword in the record). Consumers Consumer

The consumer uses the group name tag, and each record published to the subject is received by all members of the group that has subscribed to the topic. Consumer instances can be run on a separate process or on a separate machine.
If all consumers are in a group, such records are distributed evenly to the consumers within the group.
If all the consumers are in different groups, the records will be broadcast to all consumers.

Two Kafka services manage four partitions (P0-P3) Two consumer clusters, with two consumers in a and four consumers in B.
However, we found that the theme of a small number of consumers, each group of consumers are "logical subscribers", each group has a number of consumers, the goal is to extend and fault tolerance. The subscriber consumer's group, rather than a single process, is compared to the Publish/subscribe model.
Kafka is consumed by dividing the log partitions into consumer instances so that each consumer instance "shares" a partition at any point in time. In-group relationships are maintained dynamically according to the Kafka protocol. If a new member is added to the group, the new member takes over some of the partitions from the other members of the group, and if the member disappears, the partition is distributed to other instances within the group.
Kafka only ensures that records within the partition are ordered, and that the order of other partitions on the same topic is not guaranteed. Each partitioning sort combines the ability to allocate data through a key enough to meet the needs of most applications. However, if you want all the data to be ordered, you can set up a theme with only one partition, which means that there is only one consumer instance per group.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More