Message System Kafka Introduction

Last Update:2015-05-10 Source: Internet

Author: User

Tags sendfile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Overview

Kafka is LinkedIn's open source messaging system in December 2010, which is used primarily to process active streaming data. Active streaming data is very common in Web site applications, including the PV of the site, what users have visited, what content they searched for, and so on. This data is usually recorded in the form of a log, and then the statistics are processed at regular intervals.

The traditional log analysis system provides an extensible scheme for offline processing of log information, but for real-time processing, there is usually a large delay. While the existing elimination (queuing) system can handle real-time or approximate real-time applications, the unhandled data is usually not written to disk, which may be problematic for offline applications such as Hadoop (one-hour or a day-only part of the data). Kafka is designed to solve the above problems, it can be well offline and online applications.

2. Design Objectives

(1) The cost of data access on disk is O (1). General data is stored on disk using Btree, and the cost of Access is O (LGN).

(2) High throughput rate. It can handle hundreds of thousands of message per second even on ordinary nodes.

(3) Explicit distribution, that is, all producer, broker and Consumer will have multiple, are distributed.

(4) Support data is loaded into Hadoop in parallel.

3. Kafka Deployment Structure

Kafka is an explicit distributed architecture, producer, broker (Kafka), and consumer can have multiple. Kafka functions like caching, which is the cache between active data and offline processing systems. Several basic concepts:

(1) The message is the basic unit of communication, and each producer can post some messages to a topic (subject). If consumer subscribes to this topic, the newly released messages will be broadcast to these consumer.

(2) The Kafka is explicitly distributed, and multiple producer, consumer, and brokers can be run on a large cluster, serving as a logical whole to the outside world. For consumer, multiple consumer can form a group, and this message can only be transferred to one of the consumer in a group.

4, Kafka key technology points

(1) zero-copy

On Kafka, there are two reasons that can cause inefficiencies: 1) Too many network requests 2) too many byte copies. To improve efficiency, Kafka the message into a group of groups, and each request sends a set of message to the corresponding consumer. In addition, Sendfile system calls are used to reduce byte copies. In order to understand the sendfile principle, the traditional use of the socket to send files to be copied:

Sendfile system Call:

(2) exactly once message transfer

How to record the status of each consumer processing information? Only offset that each consumer has processed data is saved in Kafka. There are two benefits: 1) Less data is saved 2) When consumer error occurs, restarting consumer processing data only starts with the nearest offset.

(3) Push/pull

Producer pushes data to Kafka (push), consumer from Kafka pull data.

(4) Load balancing and fault tolerance

There is no load balancing mechanism between producer and broker.
The zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered in zookeeper, and zookeeper will save some of their metadata information. If a broker and consumer have changed, all other brokers and consumer will be notified.

Resources

"1" Kafka home: http://sna-projects.com/kafka/design.php

"2" zero-copy principle: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

"3" Kafka and Hadoop:http://sna-projects.com/sna/media/kafka_hadoop.pdf

Message System Kafka Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More