Introduction to Apache Kafka

Source: Internet
Author: User
Tags server hosting
three important functions of streaming platform
1.  Publish and subscribe streams, in this context it is similar to Message Queuing or enterprise-class messaging System
2.  Fault-tolerant Way store stream
3. Process Flow
What are the advantages of Kafka

It is mainly used in the following two major categories:

1.  construct a real-time flow data pipeline to obtain stream data between application and system.
 2.  Build a real-time streaming application that converts or corresponds to a data stream.

In order to understand how Kafka do these things, we delve into Kafka's capabilities from the bottom up.

first, a few concepts

Kafka is run as a cluster on one or more services.
The Kafka cluster stores streams by subject topics, each containing a key, a value, and a timestamp. (The following we call the message)

Four core APIs:

The Product API allows applications to post messages to one or more topics.
The consume API allows applications to subscribe to messages for one or more topics and to process the generated stream of records.
The Streams API allows an application to act as a stream processor, receiving an input stream from one or more topics, outputting an output stream of one or more topics, effectively converting an input stream into an output stream.
The Connector API allows you to build and run reusable producers or consumers and connect message topics to applications or data systems.

For example, a relational database connection can get all the changes to a table.

The Kafka client communicates with the server-side communication using a simple, high-performance, language-independent TCP protocol that is backward compatible with the TCP protocol version. The official website provides Kafka Java version of the client, but also provide other language clients. topics and logs topics and logs

First, introduce the topic: a topic is a category of messages. Topics in Kafka are always used by multiple users, meaning that a theme can be used with 0, one, or more.
For each topic, Kafka maintains a partition log, as shown in the following illustration:

Each partition is a sequential, immutable queue that is continuously appended to the structured commit log. The records for each partition are assigned an ordinal ID number, called the unique offset of the record within the partition.
The Kafka cluster retains all published records, regardless of whether the message is consumed or not, using a configurable retention period. For example, if the retention policy is set to two days, it can be consumed within two days of posting the message, and then the freed space will be destroyed. Kafka's performance is a constant level independent of the amount of data, so it's not a problem to keep too much data.

In fact, the consumer's metadata is the offset in the log. This offset is controlled by the consumer: Typically, when a consumer reads a record, the offset is increased linearly, and because the position is controlled by the consumer, it can consume records in any order. The consumer can then reset the older offsets to read the old data repeatedly, or skip the most recent record and read the updated record.
This feature suggests that Kafka consumers are lightweight and flexible, and that there is no consumer impact on the cluster. You can use command-line tools to "tail" messages on any topic without affecting other consumers who are consuming messages.

A log partition has several purposes. First, it allows the log space to scale to fit a single server, each individual partition must fit the server hosting it, but the theme can have multiple partitions, so it can handle any number of data. Second, as a unit of parallel computing, distribution distributed

The partitions of the log are distributed across the servers in the Kafka cluster, and each service handles the data on the partition to which it is divided. Partitions can be configured to replicate to other servers to achieve fault tolerance.
Each partition has a server that acts as a "leader" leader, with 0 or more servers acting as "followers". The leader handles all read and write requests to the partition and is passively replicated by the leader. If the server that serves as the leader is down, pick a new leader from the leader's cluster. Each service is the leader of some partitions, and other services are followed to achieve load balancing in the cluster. producers Producers

The producer publishes the data to the subject of their formulation, and the producer is responsible for selecting which partition in the topic. This can be done simply by looping the load, or by using semantic partitioning (which marks the keyword in the record). Consumers Consumer

The consumer uses the group name tag, and each record published to the subject is received by all members of the group that has subscribed to the topic. Consumer instances can be run on a separate process or on a separate machine.
If all consumers are in a group, such records are distributed evenly to the consumers within the group.
If all the consumers are in different groups, the records will be broadcast to all consumers.

Two Kafka services manage four partitions (P0-P3) Two consumer clusters, with two consumers in a and four consumers in B.
However, we found that the theme of a small number of consumers, each group of consumers are "logical subscribers", each group has a number of consumers, the goal is to extend and fault tolerance. The subscriber consumer's group, rather than a single process, is compared to the Publish/subscribe model.
Kafka is consumed by dividing the log partitions into consumer instances so that each consumer instance "shares" a partition at any point in time. In-group relationships are maintained dynamically according to the Kafka protocol. If a new member is added to the group, the new member takes over some of the partitions from the other members of the group, and if the member disappears, the partition is distributed to other instances within the group.
Kafka only ensures that records within the partition are ordered, and that the order of other partitions on the same topic is not guaranteed. Each partitioning sort combines the ability to allocate data through a key enough to meet the needs of most applications. However, if you want all the data to be ordered, you can set up a theme with only one partition, which means that there is only one consumer instance per group.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.