Apache Kafka Introduction

Source: Internet
Author: User
Tags server hosting
three important functions of streaming platform
1.  Publish and subscribe to the stream, in this regard it resembles Message Queuing or enterprise-level messaging System
2.  Fault tolerant way to store stream
3. Process Flow
What are the advantages of Kafka?

It is mainly used in the following two major categories:

1.  build a real-time streaming data pipeline to get streaming data between the application and the system.
 2.  Build a real-time streaming application that transforms or corresponds to a data stream.

To understand how Kafka does these things, we delve deeper into Kafka's capabilities from the bottom up.

first several concepts

Kafka runs on one or more services as a cluster.
Kafka clusters store streams by subject topics, each containing a key, a value, and a timestamp. (hereinafter we call messages)

Four core APIs:

The Product API allows applications to publish messages to one or more topics.
The consume API allows applications to subscribe to messages for one or more topics and to process the generated stream of records.
The Streams API allows an application to act as a stream processor, receive an input stream from one or more topics, output a stream of one or more topics, and effectively convert an input stream into an output stream.
The Connector API allows you to build and run reusable producers or consumers, connecting message topics to applications or data systems.

For example, a connection to a relational database can get all changes to a table.

Kafka's client-to-server communication uses a simple, high-performance, language-independent TCP protocol that is backwards compatible with the TCP protocol version. The official website provides the Kafka Java version of the client and also provides clients in other languages. Topics and logs themes and logs

Let's start with the topic: a topic is a category of messages. Kafka themes are always used by multiple users, meaning that a theme can be used for 0, one, or more.
For each topic, Kafka maintains a partition log, as shown in the following figure:

Each partition is a sequential, immutable queue and is continuously appended to the structured commit log. Each partition's record is assigned an ordinal ID number, which is known as the offset of the record within the partition.
The Kafka cluster retains all published records, regardless of whether the message is consumed, using a configurable retention period. For example, if the retention policy is set to two days, it can be consumed within two days of publishing the message and then destroyed to free up space. Kafka performance is a constant level independent of the amount of data, so keeping too much data is not a problem.

In fact, the consumer's meta-data is the offset in the log. This offset is controlled by the consumer: usually the consumer reads the record, the offset is linearly increased, because the position is controlled by the consumer, it can consume records in any order. This allows the consumer to reset the older offset to repeat the old data, or to skip the recent record and read the updated record.
This feature shows that Kafka consumers are lightweight and flexible, with or without consumers having no impact on the cluster. You can use the command-line tool "tail" messages on any topic without affecting other consumers who are consuming messages.

The log partition has several purposes. First, it allows the log space to scale to the size of a single server, and each individual partition must be suitable for the server hosting it, but the subject can have multiple partitions, so it can process any amount of data. Second, they act as units of parallel computing, distribution distributed

The partitions of the logs are distributed across the servers in the Kafka cluster, and each service processes the data on the partitions to which it is divided. Partitions can be configured to replicate to other servers for fault tolerance.
Each partition has a server that acts as the "leader" leader, with 0 or more servers serving as "followers" leaders. The leader handles all read and write requests to the partition, and the leader passively replicates the leader. If the server that serves as a leader goes down, pick one from the leader's cluster as the new leader. Each service is a leader in some partitions and other services are followed to achieve load balancing in the cluster. producers Producers

Producers publish data to the topics they set, and producers are responsible for selecting which partition in the topic. This provides a simple way to balance the load by looping, or by using the semantic partitioning feature (keywords are flagged in the record). Consumers Consumer

Consumers are tagged with the group name, and each record published to the topic is received by all members of the group subscribed to the topic. A consumer instance can run on a separate process or on a separate machine.
If all the consumers are in a group, such a record is evenly distributed to the consumers within the group.
If all the consumers are in different groups, the records will be broadcast to all the consumers.

Two Kafka services manage four partitions (P0-P3) Two consumer clusters, a has two consumers, and B has four consumers.
However, we find that the number of consumers in the subject is small, and every consumer in the group is a "logical subscriber", with each group having multiple consumers, with the aim of expanding and fault-tolerant. The subscriber consumer group, rather than a single process, compared to the Publish/subscribe model.
Kafka is consumed by dividing the log partition into consumer instances so that each consumer instance "shares" a partition at any point in time. Intra-group relationships are maintained dynamically according to the Kafka protocol. If a new member joins the group, the new member takes over some of the partitions from the other members of the group, and if the member disappears, the partition is distributed to other instances within the group.
Kafka only guarantees that the records in the partition are ordered, the order of the other partitions on the same topic is not guaranteed. Each partition sort combined with the ability to allocate data by key is sufficient to meet the needs of most applications. However, if you want all the data to be ordered, you can set a theme with only one partition, which means that there is only one consumer instance per group.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.