Apache Kafka tutorial notes

Source: Internet
Author: User

This article is based on Kafka 0.8

1. Introduction

Internet enough Company logs are everywhere, such as web logs, js logs, search logs, and monitoring logs. For the offline analysis (Hadoop) of these logs, wget & rsync can meet the functional line requirements despite the high labor maintenance cost. However, for the real-time analysis requirements of these logs (such as real-time recommendation and monitoring systems), it is often necessary to introduce some "high-end" systems.

Traditional Enterprise message systems (such as WebSphere) are not very suitable for large-scale log processing systems for the following reasons:
1) too much attention to reliability, which increases the complexity of system implementation and API, while several logs are often lost in log processing"
2) The design concepts including API, scale, and message buffer are not suitable for the log processing system of Hign Throughput.

To address these problems, various companies have developed their own log collection systems, such as Facebook's Scribe, Yahoo's data highway, Cloudera's Flume, Apache's Chukwa, and Baidu's BigPipe, alibaba's RocketMQ.

Kafka is a high-throughput distributed message system developed and open-source by LinkedIn. It has the following features:
1) supports high-Throughput applications
2) scale out: scale out the machine without downtime
3) Persistence: data is persisted to the hard disk and replication to prevent data loss.
4) supports online and offline scenarios.

2. Introduction

Kafka is developed using scala and supports multi-language clients (c ++, java, python, And go). Its architecture is as follows [2]:

Producer: Message publisher
Broker: message-oriented middleware processing node. A kafka node is a broker.
Consumer: Message subscriber

 

Kafka messages are divided into several layers:
1) Topic: a type of message, such as page view logs and click logs, can exist in the form of topics. The kafka cluster can distribute multiple topics at the same time.
2) Partition: Physical grouping of a Topic. A topic can be divided into multiple partitions. Each partition is an ordered queue. Each message in partition is assigned an ordered id (offset ).
3) Message: Message, minimum subscription unit

Specific process:
1. Based on the specified partition method (round-robin, hash, etc.), the Producer publishes messages to the partition of the specified topic.
2. After receiving the message sent by the Producer, the kafka cluster persists the message to the hard disk and keeps the message as scheduled (configurable), regardless of whether the message is consumed.
3. The Consumer obtains the pull data from the kafka cluster and controls the offset of the message.

 

3. Design

ThroughPut
High Throughput is one of the core objectives of kafka. Therefore, kafka has made the following designs:
1) Data disk Persistence: messages are not cached in the memory and directly written to the disk to fully utilize the sequential read/write performance of the disk.
2) zero-copy: reduce IO operation steps
3) Batch Data Transmission
4) Data Compression
5) topics are divided into multiple partitions to improve parallelism.

 

Load balance & HA
1) The producer sends messages to the specified partition based on the algorithm specified by the user.
2) Multiple partitions exist. Each partition has its own replica, and each of which is distributed on different Broker nodes.
3) Multiple partitions need to be selected for lead partition. lead partition is responsible for reading and writing, and zookeeper is responsible for fail over
4) Manage the dynamic addition and exit of broker and consumer through zookeeper

Pull-based system
Since the kafka broker persists data and the broker has no memory pressure, consumer is very suitable for consuming data in the pull mode and has the following benefits:
1) Simplified kafka Design
2) consumer controls the message pulling speed based on consumption capability
3) consumer selects the consumption mode based on its own situation, such as batch consumption, repeated consumption, and consumption starting from the end.

Scale Out
When you need to add a broker endpoint, the newly added broker will register with zookeeper, and the producer and consumer will perceive these changes based on the watcher registered on zookeeper and make timely adjustments.

For details about Kafka, click here
Kafka: click here

Kafka architecture design of the distributed publish/subscribe message system

Apache Kafka code example

  • 1
  • 2
  • Next Page
[Content navigation]
Page 1: Basics Page 6: Environment Setup & Testing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.