Message System Kafka Introduction

Source: Internet
Author: User
Tags sendfile

1. Overview

Kafka is LinkedIn's open source messaging system in December 2010, which is used primarily to process active streaming data. Active streaming data is very common in Web site applications, including the PV of the site, what users have visited, what content they searched for, and so on. This data is usually recorded in the form of a log, and then the statistics are processed at regular intervals.

The traditional log analysis system provides an extensible scheme for offline processing of log information, but for real-time processing, there is usually a large delay. While the existing elimination (queuing) system can handle real-time or approximate real-time applications, the unhandled data is usually not written to disk, which may be problematic for offline applications such as Hadoop (one-hour or a day-only part of the data). Kafka is designed to solve the above problems, it can be well offline and online applications.

2. Design Objectives

(1) The cost of data access on disk is O (1). General data is stored on disk using Btree, and the cost of Access is O (LGN).

(2) High throughput rate. It can handle hundreds of thousands of message per second even on ordinary nodes.

(3) Explicit distribution, that is, all producer, broker and Consumer will have multiple, are distributed.

(4) Support data is loaded into Hadoop in parallel.

3. Kafka Deployment Structure


Kafka is an explicit distributed architecture, producer, broker (Kafka), and consumer can have multiple. Kafka functions like caching, which is the cache between active data and offline processing systems. Several basic concepts:

(1) The message is the basic unit of communication, and each producer can post some messages to a topic (subject). If consumer subscribes to this topic, the newly released messages will be broadcast to these consumer.

(2) The Kafka is explicitly distributed, and multiple producer, consumer, and brokers can be run on a large cluster, serving as a logical whole to the outside world. For consumer, multiple consumer can form a group, and this message can only be transferred to one of the consumer in a group.

4, Kafka key technology points

(1) zero-copy

On Kafka, there are two reasons that can cause inefficiencies: 1) Too many network requests 2) too many byte copies. To improve efficiency, Kafka the message into a group of groups, and each request sends a set of message to the corresponding consumer. In addition, Sendfile system calls are used to reduce byte copies. In order to understand the sendfile principle, the traditional use of the socket to send files to be copied:

Sendfile system Call:

(2) exactly once message transfer

How to record the status of each consumer processing information? Only offset that each consumer has processed data is saved in Kafka. There are two benefits: 1) Less data is saved 2) When consumer error occurs, restarting consumer processing data only starts with the nearest offset.

(3) Push/pull

Producer pushes data to Kafka (push), consumer from Kafka pull data.

(4) Load balancing and fault tolerance

There is no load balancing mechanism between producer and broker.
The zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered in zookeeper, and zookeeper will save some of their metadata information. If a broker and consumer have changed, all other brokers and consumer will be notified.

Resources

"1" Kafka home: http://sna-projects.com/kafka/design.php

"2" zero-copy principle: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

"3" Kafka and Hadoop:http://sna-projects.com/sna/media/kafka_hadoop.pdf

Message System Kafka Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.