Kafka Basic Introduction

Last Update:2017-01-19 Source: Internet

Author: User

Tags kafka streams

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka Foundation

Kafka has four core APIs:

The application uses Producer API a publishing message to 1 or more topic (themes).
The application uses Consumer API to subscribe to one or more topic and process the resulting message.
Applications use Streams API acting as a stream processor, consuming input streams from 1 or more topic, and producing an output stream to 1 or more output topic, effectively swapping input streams to the output stream.
Connector APIAllows the building or running of reusable producers or consumers to connect topic to existing applications or data systems. For example, a connector for a relational database can capture every change.

650) this.width=650; "Width=" "height=" "class=" Amd-center "src=" http://img.orchome.com:8888/group1/M00/00/01 /kmcudlf7dxiavxbmaafsckns-og538.png "alt=" screenshot "/>

Communication between client and server is through a simple, high-performance TCP protocol that is independent of the development language. In addition to the Java client, there are a lot of other programming language client.

Basic terminology used by Kafka: Topic

Kafka the Message Seed (Feed), each type of message is called a topic (Topic).

Producer

The object that publishes the message is called the theme producer (Kafka topic producer)

Consumer

The object that subscribes to the message and processes the seed of the published message is called the subject consumer (consumers)

Broker

Published messages are stored in a set of servers called Kafka clusters. Each server in the cluster is an agent (broker). Consumers can subscribe to one or more topics (topic) and pull data from the broker to consume these published messages.

Topics and logs (topic and log)

Let's take a deeper look at the topic in Kafka.

Topic is the category or seed feed name of the published message. For each Topic,kafka cluster, the log for this partition is maintained, as in the example:

650) this.width=650; "Width=" height= "class=" Amd-center "src=" http://img.orchome.com:8888/group1/M00/00/01 /kmcudlf7dsaavf0waabme0j0lv4158.png "alt=" screenshot "/>

Each partition is a sequential, immutable message queue and can be added continuously. Messages in the partition are divided into a sequence number called offset, which is unique in each partition.

The Kafka cluster keeps all messages until they expire, regardless of whether the message is consumed. In fact, the only meta-data that consumers hold is the offset, which is where the consumer is in the log. This offset is controlled by the consumer: normally when consumers consume messages, the offsets are linearly increased. But the actual offset is controlled by the consumer, and the consumer can reset the offset to an older offset and reread the message. It can be seen that this design is easy for consumers to operate, and that a consumer's actions do not affect the handling of this log by other consumers. Say Partition again. The design of partitioning in Kafka has several purposes. One is that more messages can be processed, not limited by a single server. Topic having multiple partitions means that it can handle more data without limitation. Second, the partition can be used as a unit of parallel processing, which is discussed later. 650) this.width=650; "Width=" height= "class=" Amd-center "src=" http://img.orchome.com:8888/group1/M00/00/01 /kmcudlf7d2ialxg_aaihinslf_q676.png "alt=" screenshot "/>

Distributed (distribution)

The partition of log is distributed to multiple servers in the cluster. Each server processes the partition it is divided into. Depending on the configuration, each partition can also be replicated to other servers as backup fault tolerance. Each partition has one leader, 0 or more follower. Leader handles all read and write requests for this partition, while follower replicates the data passively. If the leader goes down, the other follower will be elected as the new leader. One server may be one partition of leader, and the other partition is follower. This balances the load and avoids all requests being processed by only one or a few servers.

Producer (producers)

The producer publishes a message to a topic. The producer is also responsible for choosing which partition to publish to the topic. The simplest way to choose from a list of partitions is in turn. You can also select a partition according to the weight of an algorithm. The developer is responsible for how to select the partitioning algorithm.

Consumer (consumers)

Generally speaking, the message model can be divided into two types, queue and publish-subscribe. A queue is handled by a group of consumers reading messages from the server, a message that only one of the consumers can handle. In the publish-subscribe model, messages are broadcast to all consumers, and consumers who receive messages can process the message. Kafka provides a single consumer abstraction model for both models: the consumer group (consumer group). Consumers mark themselves with a consumer group name. A message posted on topic is distributed to a consumer in this consumer group. If all the consumers are in a group, then this becomes the queue model. If all the consumers are in different groups, then it becomes the publish-subscribe model completely. More general, we can create some consumer groups as logical subscribers. Each group contains a range of consumers, and multiple consumers within a group can be used to scale performance and fault tolerance. As shown in:
650) this.width=650; "Width=" height= "class=" Amd-center "src=" http://img.orchome.com:8888/group1/M00/00/01 /kmcudlf7d-oaejy8aaboxglnmi4173.png "alt=" screenshot "/>

2 Kafka cluster hosts 4 partitions (P0-P3), 2 consumer groups, consumer group A has 2 consumer instances, and consumer group B has 4.

As with traditional messaging systems, Kafka guarantees that the order of messages is constant. Just a few more words. The traditional queue model keeps messages, and ensures that their sequencing remains the same. However, although the server guarantees the order of messages, the message is sent asynchronously to individual consumers, the order in which consumers receive messages is not guaranteed. This also means that parallel consumption will not guarantee the sequencing of messages. Students who have used the traditional messaging system must be aware that the order of the messages is a headache. If only one consumer to deal with the message, but also against the original intention of parallel processing. Kafka did better at this point, although it did not solve the problem completely. Kafka employs a divide-and-conquer strategy: partitioning. Because messages in the topic partition can only be processed by the only consumer in the consumer group, the message must be processed in order of precedence. However, it is only a partitioning sequence that guarantees topic, and it does not guarantee the sequential processing of messages across partitions. So, if you want to process all the messages in the topic sequence, then only one partition is provided.

Kafka's Guarantee (Guarantees)

The producers are sent to a specific topic partition, and the messages are joined in the order they were sent, that is, if a message M1 and M2 are sent using the same producer, M1 is sent first, then M1 is lower than the M2 offset, and the priority appears in the log.
Messages received by consumers are also in this order.
If a topic is configured with a replication factor (replication facto) of n, you can allow the N-1 server to be down without losing any messages that have been committed (committed).

For more details on these warranties, see the design section of the documentation.

What is the concept of Kafka as a message system Kafka compared to a traditional enterprise messaging system?

There are two modes of traditional messaging: 队列 and 发布订阅 . In queue mode, the consumer pool reads messages from the server (each message is read only by one); Publish subscription Warcraft: Message broadcast to all consumers. Both of these models have pros and cons, and the advantage of queuing is that it allows multiple consumers to carve out processing data so that they can scale processing. However, the queue is not like multiple subscribers, and once the message is read by the process fails, the message is lost. While 发布和订阅 allowing you to broadcast data to multiple consumers, there is no way to scale processing because each Subscriber subscribes to a message.

The consumer group in Kafka has two concepts: 队列 the consumer group (consumer group) allows the same-name consumer groups to carve up the deal. 发布订阅: Allows you to broadcast messages to multiple consumer groups (different names).

Each topic of the Kafka has these two modes.

Kafka has a stronger order guarantee than the traditional messaging system.

Traditional messaging systems store data sequentially, and if multiple consumers consume from the queue, the server sends messages in the order they are stored, but even though the servers are sent sequentially and the messages are delivered asynchronously to the consumer, the messages can arrive in a disorderly order. This means that there is a case of concurrent consumption of the message, and the order cannot be guaranteed. Messaging systems often solve this problem by setting up only 1 consumers, but this means that there is no use in parallel processing.

Kafka is doing better. Sequential guarantees and load balancing are provided through the Parition--kafka of the parallel topic. Each partition is consumed only by one consumer in the same consumer group. and ensure that consumers are the only consumers of the partition, and consume data sequentially. With multiple partitions per topic, you need to load balance multiple consumers, but be aware that 相同的消费者组中不能有比分区更多的消费者，否则多出的消费者一直处于空等待，不会收到消息 .

Kafka as a storage system

All of the systems that publish messages to 消息队列 and consume separate, in effect, act as a storage system (the published messages are stored first). The advantage of Kafka than other systems is that it is a very high performance 存储系统 .

Data written to Kafka will be written to disk and replicated to the cluster to ensure fault tolerance. and allows the producer to wait for the message to be answered until the message is fully written.

Kafka disk structure-the execution is the same regardless of whether you have 50KB or 50TB on your server.

Client to control where the data is read. You can also think of Kafka as a proprietary for high performance, low latency, commit log storage, replication, and propagation for special purposes 分布式文件系统 .

Stream processing for Kafka

Just reading, writing and storing is not enough, the goal of Kafka is real-time stream processing.

In Kafka, the stream processes data that is continuously acquired 输入topic , processed, and then written 输出topic . For example, a retail app that receives sales and shipments 输入流 , statistics quantities, or adjusts prices after the output.

Simple processing can be done directly using the producer and consumer APIs. For complex transformations, Kafka provides a more powerful streams API. 聚合计算complex applications that can be built or 连接流到一起 .

Help solve the hard problems faced by such applications: Handling unordered data, re-processing code changes, performing state calculations, and more.

The core of the Sterams API in Kafka: using the producer and consumer APIs as inputs, using Kafka for state storage, and using the same group mechanism to guarantee fault tolerance between stream processor instances.

To put together

The combination of messaging, storage, and stream processing may seem unnatural, but it is critical for Kafka as a streaming platform.

Distributed file systems such as HDFs allow static files to be stored for batch processing. This allows the system to efficiently store and process historical data from the past.

The traditional enterprise messaging system allows future messages to be processed after you subscribe: process it when the data arrives in the future.

Kafka combines these two capabilities, and this combination is critical for Kafka as a streaming application and streaming data pipeline platform.

Batching and the concept of stream processing for message-driven applications: with combination storage and low-latency subscriptions, streaming applications can treat past and future data in the same way. It is a single application that can handle the history of storing data, and when it processes to the last message, it goes into waiting for future data to arrive, rather than end.

Similarly, for streaming data pipelines (pipeline), the combination of subscription real-time events allows Kafka to be used for very low latency pipelines, but the ability to reliably store data allows it to be used for critical data that must be delivered, or to be integrated with off-line systems that load data only periodically or for extended periods of time. Stream processing can transform the data as it arrives.

Here are some of the Apache kafka The usage scenario

News

Kafka better replace the traditional messaging system, the messaging system is used in various scenarios (decoupling data producers, cache unhandled messages, etc.), compared to most messaging systems, Kafka has better throughput, built-in partitions, replicas and failovers, which facilitates processing of large-scale messages.

According to our experience, messages are often used for lower throughput, but require low 端到端 latency and require a guarantee of robust robustness.

Kafka in this field are comparable to traditional messaging systems, such as ActiveMQ or RabbitMQ .

Website Activity Tracking

Kafka Original usage Scenario: User Activity Tracking, website activity (web browsing, search or other user's operation information) published to different topic centers, these messages can be processed in real time, real-time monitoring, can also be loaded into Hadoop or offline processing Data warehouse.

Each user page view produces a very high amount.

Index

Kafka are also often used to monitor data. Aggregated in a statistical dataset generated by a distributed application.

Log Aggregation

Use Kafka instead of a solution for log aggregation.

Stream processing

Kafka message processing contains multiple stages. Where the raw input data is consumed from the Kafka theme, then aggregated, enriched, or otherwise processed into a new topic, for example, a featured news article, which may be obtained from the "articles" topic, and then further processed to get a new post-processing content, and finally recommended to the user. This processing is based on a single topic of real-time data flow. From the 0.10.0.0 start, the lightweight, but powerful stream processing is done with such data processing.

In addition to Kafka Streams, there are Apache Storm and Apache Samza options.

Event capture

Event acquisition is the design style of an application in which the state changes are recorded according to the Order of time, and Kafka supports this very large scenario of storing log data.

Submit Log

Kafka can be used as a distributed external commit log, the log helps to replicate data between nodes, and as a failed node to recover data resynchronization, Kafka's log compression feature is a good way to support this usage, which is similar to a Apacha BookKeeper project.

This article is from the "Practical Linux knowledge and Skills sharing" blog, please be sure to keep this source http://superleedo.blog.51cto.com/12164670/1893080

Kafka Basic Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More