Kafka Learning Road (ii)--Improve

Source: Internet
Author: User
Tags sendfile zookeeper

Kafka Learning Road (ii)--improve the message sending process

because Kafka is inherently distributed , a Kafka cluster typically consists of multiple agents. to balance the load, divide the topic into multiple partitions , each agent stores one or more partitions . multiple producers and consumers can produce and get messages at the same time .

Process:

1.Producer publishes the message to the partition of the specified topic according to the specified partition method (Round-robin, hash, etc.)

When a 2.kafka cluster receives a message from producer, it persists it to the hard disk and retains the message for a specified length of time (configurable) without paying attention to whether the message is being consumed.

3.Consumer pull data from the Kafka cluster and control the offset of the get message

Principle:

The producer uses its own serialization method to encode the message content. It then initiates a message to the broker. To improve efficiency, a publishing request can contain a set of messages.

The consumer subscribes to the topic and creates one or more message flows for the topic. Messages posted to this topic are distributed evenly into these streams.

Each message flow provides an iterative interface for continuously generated messages.

The consumer iterates through each message in the stream and processes the payload of the message.

The iterator does not stop. If there is no current message, the iterator will block until a new message is posted to the topic

KafkaStore

The Kafka storage layout is simple. each partition of the topic corresponds to a logical log. Physically, a log is a set of fragmented files of the same size . each time a producer publishes a message to a partition, the agent appends the message to the last segment file . when the number of messages posted reaches the set value or after a certain amount of time , the segment file is actually written to disk. When the write is complete, the message is exposed to the consumer.

Unlike traditional messaging systems, messages stored in the Kafka system do not have a clear message ID. the message is exposed through the logical offset in the log . This avoids the overhead of maintaining a companion dense addressing that maps the random-access index structure of the message ID to the actual message address. The message offset is incremental, but not contiguous. To calculate the offset of the next message, you can add the length of the current message based on its logical offset .

consumers always get the message sequentially from a particular partition, and if the consumer knows the offset of a particular message, it means that the consumer has consumed all the previous messages . The consumer sends an asynchronous pull request to the proxy and prepares the byte buffer for consumption. each asynchronous pull request contains the message offset to consume . Kafka uses the sendfile API to efficiently distribute bytes to consumers from the agent's log segment files.



Agent

Unlike other messaging systems, theKafka agent is stateless, that is, consumers must maintain the status messages that have been consumed, and the agent does not care.

The innovation of this design lies in:

· The agent applies a time-based SLA to a retention policy. When a message exceeds a certain amount of time in the agent, it is automatically deleted.

· Consumers can deliberately pour back the old offset to consume data again. While this violates common conventions for queues, it is common in many businesses.



The relationship with zookeeper

Kafka uses zookeeper for managing and coordinating agents. Each Kafka agent coordinates other Kafka agents through zookeeper.

When a new agent or an agent fails in the Kafka system, the Zookeeper service notifies producers and consumers. Producers and consumers begin to coordinate their work with other agents accordingly.

Zookeeper Roles played in Kakfa :Kafka saves metadata information in zookeeper , but data sent to topic itself is not sent to ZK

         kafka use zookeeper producer consumer ) configuration. broker zookeeper Register and maintain the relevant metadata ( topic partition information, etc.) updated.

         zookeeper watcher zookeeper broker broker can still be automatically load balanced. The client here refers to the kafka (Producer) and message consumer (Consumer )

· the broker side uses zookeeper to register broker Information , and to monitor partitionleader survivability .

· ConsumerEnd UseZookeeperused to registerConsumerInformation,these includeConsumerconsumption ofPartitionlist, etc.,It is also used to discoverBrokerList,and andPartitionleaderEstablishSocketConnection,and get the message.

· zookeer and Producer did not establish relations, only and Brokers,consumers establish relationships to achieve load balancing, i.e. the same Consumergroup in the Consumers load Balancing can be achieved ( because Producer is transient and can be sent back off without having to wait directly )

The design of Kafka

1. Throughput

High Throughput is one of the core goals that Kafka needs to achieve, and for this reason Kafka has made the following designs:

1. Data disk persistence: Message is not in memory cache, write directly to disk, take full advantage of disk's sequential read and write performance

2. zero-copy: Reduce IO operation steps

3. data Batch sending

4. Data compression

5. Topic divided into multiple partitionto improve parallelism (parallel)

2. Load Balancing

1. producer sends a message to the specified partition based on the user-specified algorithm

2. There are multiple Partiiton, each partition has its own replica, each replica distributed on different broker nodes

3. Multiple partition need to select the Leadpartition,lead partition is responsible for reading and writing, and the zookeeper is responsible for the fail over

4. Manage broker and consumer dynamic join and leave through zookeeper

3. Pull System

because Kafka broker persists data, the broker has no memory pressure, so consumer is well suited for pull-consuming data, with the following benefits:

1. simplified Kafka Design

2. Consumer Control the speed of message extraction according to the consumption capacity

3. consumer according to their own circumstances to choose consumption patterns, such as batch, repeat consumption, starting from the end of consumption, etc.

4. Scalability

when the broker node needs to be increased, the new broker will register with zookeeper, and producer and consumer will perceive the changes based on watcher registered on the zookeeper and make adjustments in a timely manner.


Kafka Application Scenarios:

1. Message Queuing

Kafka has better throughput, built-in partitioning, redundancy, and fault tolerance than most messaging systems , making Kafka a good solution for large-scale messaging applications. The messaging system generally has relatively low throughput, but requires a smaller end-to-end delay and a taste of the robust durability protection that is dependent on Kafka. In this field, Kafka is comparable to traditional messaging systems such as activemr or RabbitMQ.

2. Behavioral Tracking

Kafka Another scenario is to track the user's browsing page, search, and other behaviors, and record them in real time to the corresponding topic in the Publish-subscribe mode. Then these results are received by subscribers and can be processed in real time or in real time, or in the hadoop/offline Data Warehouse.

3. Meta-Information monitoring

As a monitoring module for operational records, the collection records some operation information, which can be understood as the data monitoring of operation and maintenance.

4. Log Collection

use Kafka instead of log aggregation (logaggregation). Log aggregation typically collects log files from the server and then places them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into the message flow of a log or event. This makes the Kafka processing process less latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as scribe or Flume, Kafka provides the same efficient performance and higher durability guarantees due to replication, as well as lower end-to-end latency.

5. Stream Processing

This scenario may be more and better understood. Save the collection stream data to provide a post-docking storm or other streaming computing framework for processing. Many users will process the data from the original topic, summarize it, expand it, or convert it in other ways to the new topic and proceed with the subsequent processing. For example, a recommended process for an article is to crawl the content of an article from an RSS feed, and then throw it into a topic called "article," which may need to be cleaned up, such as restoring normal data or deleting duplicate data, and then returning the results of the content match to the user. This is in addition to a separate topic, which produces a series of processes for real-time data processing. Strom and Samza are very well-known frameworks for implementing this type of data conversion.

6. Event Source

An event source is an application-design approach in which state transitions are recorded as chronological sequence of records. Kafka can store a large amount of log data, which makes it an excellent backstage for applications in this way. such as dynamic aggregation (news Feed).

7. Persistent log (commit log)

Kafka can serve a distributed system that is an external persistent log. Such logs can back up data between nodes and provide a mechanism for resynchronization of failed node data replies. the log compression feature in Kafka provides a condition for this usage. In this usage, Kafka is similar to the Apache Bookkeeper project.



Key points of Kafka design:

1 , using the cache of the Linux file system directly to cache data efficiently.

2 , using Linux zero-copy to improve transmission performance. The traditional data transmission needs to send 4 times the context switch, after using the Sendfile system call, the data directly in the kernel state exchange, the system context switch reduces to 2 times. Depending on the test results, you can increase the data delivery performance by 60%. Zero-copy detailed technical details can be consulted: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

3 , the cost of data access on disk is O (1). Kafka with topic for message management, each topic contains multiple parts (ition), each part corresponds to a logical log, with multiple segment. Multiple messages are stored in each segment, and the message ID is determined by its logical location, where the message ID can be directly positioned to the location where the message is stored, avoiding an additional mapping of the ID to the location. Each part corresponds to an index in memory, recording the first message offset in each segment. Messages sent to a topic by the Publisher are distributed evenly across multiple part (randomly or based on user-specified callback functions), and the broker receives a publish message to add the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configured value or the message is published longer than the threshold, the message on segment is flush to disk, and only the message Subscribers flush to disk can subscribe to it, and segment will not write the data to that segment after reaching a certain size , the broker creates a new segment.

4, explicit distribution, that is, all producer, broker, and consumer will have multiple, all distributed. There is no load balancing mechanism between producer and broker. The zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered in zookeeper, and zookeeper will save some of their metadata information. If a broker and consumer have changed, all other brokers and consumer will be notified.

Reprint Please specify http://blog.csdn.net/tanggao1314/article/details/51932329

Kafka Learning Road (ii)--Improve

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.