Kafka Design and principle detailed

Source: Internet
Author: User
Tags constant file size garbage collection socket time interval zookeeper port number hadoop ecosystem
I. Introduction of Kafka
This article synthesizes the Kafka related articles I wrote earlier, which can be used as a comprehensive knowledge of learning Kafka training and learning materials.

Reprint please indicate the source: This article Links 1.1 background history

In the era of big data, we are faced with several challenges: how to collect these huge information and how to analyze how it can be done in time, such as business, social, search, browse and other information factories. Two points

The above challenges form a business demand model, which is the information of producer production (produce), consumer consumption (consume) (processing analysis), and between producers and consumers, a bridge-messaging system that communicates both. From a micro level, this requirement can also be understood as how messages are delivered between different systems. 1.2 Kafka Birth

Kafka by linked-in Open source
Kafka-is a framework for solving these problems, which enables a seamless connection between producers and consumers.
kafka-high-yield distributed messaging System (A high-throughput Distributed messaging system) 1.3 Kafka now


Apache Kafka is a distributed, Push-subscribe-based messaging system that features fast, extensible, and durable. It is now an open source system owned by Apache and is widely used by various commercial companies as part of the Hadoop ecosystem. Its greatest feature is the ability to process large amounts of data in real time to meet a variety of demand scenarios such as Hadoop-based batch processing systems, low-latency real-time systems, and Storm/spark streaming engines. ii. Kafka Technology Overview 2.1 Kafka features high throughput, low latency: Kafka can process hundreds of thousands of messages per second with a delay of only a few milliseconds of scalability: Kafka cluster supports thermal extended persistence, Reliability: Messages are persisted to local disks, and support data backup to prevent data loss tolerance: Allow nodes in the cluster to fail (if the number of replicas is N, allow n-1 nodes to fail) high concurrency: Support thousands of clients to read and write 2.2 Kafka Some important design ideas

The following is a general introduction of Kafka's main design ideas, can let the relevant personnel in a short period of time to understand the Kafka-related characteristics, if you want to further study, the following will be on each of the characteristics are described in detail. Consumergroup: Each consumer can be composed of one group, each message can only be consumed by one consumer in the group, and if a message can be consumed by more than one consumer, then these consumer must be in different groups. Message status: In Kafka, the state of the message is saved in consumer, and the broker does not care which message is consumed by whom, and only one value of offset (pointing to the next message location to be consumed in partition) is recorded. This means that if the consumer is not handled well, a message on the broker may be consumed several times. Message persistence: The Kafka will persist messages to the local file system and remain extremely efficient. Message validity: Kafka will retain the message for a long time, so that consumer can spend it multiple times, but many of these details are configurable. Bulk send: Kafka supports bulk sending in message collections to improve push efficiency. PUSH-AND-PULL&NBSP: producer and consumer in Kafka are in Push-and-pull mode, producer just push messages to broker, consumer from broker Pull messages, both of which are asynchronous for the production and consumption of messages. Kafka the relationship between brokers in a cluster: not a master-slave relationship, where each broker is in the same position as the cluster, we can arbitrarily add or remove any broker node. Load balancing: Kafka provides a metadata API to manage the load between brokers (for kafka0.8.x, for the 0.7.x, it relies on zookeeper for load balancing). Synchronous asynchronous: The producer uses asynchronous push mode, which greatly improves the throughput rate of the Kafka system (either synchronously or asynchronously through parameter control). Partition mechanism Partition:kafka the broker side supports message partitioning, producer can decide which partition to send messages to, the order of messages in a partition is producer the order in which messages are sent, a topic can have multiple partitions, the number of specific partitions is configurable. The meaning of partitioning is significant, and the content behind it is gradually reflected. Offline data loading: Kafka It is also ideal for data loading into Hadoop or data warehouses due to support for extensible data persistence. Plugin support: Now a lot of activeCommunity has developed a number of plugins to extend the functionality of Kafka, such as the plug-ins associated with Storm, Hadoop, Flume. 2.3 Kafka Application ScenarioLog collection: A company can use Kafka can collect a variety of services log, through the Kafka in a unified interface services to open to a variety of consumer, such as Hadoop, Hbase, SOLR and so on. Messaging systems: Decoupling and producer and consumer, cache messages, and more. User Activity Tracking: Kafka is often used to record various activities of Web users or app users, such as Web browsing, search, and click Activities, which are posted to Kafka topic by individual servers, and Subscribers subscribe to these topic to perform real-time monitoring and analysis. or load into Hadoop, data warehousing to do offline analysis and mining. Operational indicators: Kafka is also often used to record operational monitoring data. This includes collecting data from various distributed applications and producing centralized feedback on various operations, such as alarms and reports. Streaming: such as spark streaming and storm event sources2.4 Kafka Architecture Components

The object for publishing subscriptions in Kafka is topic. We can create a topic for each type of data, calling the client that publishes the message to topic the producer, which is called consumer from the client topic subscribing to the message. Producers and consumers can read and write data from multiple topic at the same time. A Kafka cluster consists of one or more broker servers that are responsible for persisting and backing up specific Kafka messages. Topic: The directory where the message resides is the subject Producer: The party that produces the message to topic Consumer: A party that subscribes to topic consumer messages Broker:kafka The service instance is a Broker

2.5 Kafka Topic&partition

The message is sent to a topic, which is essentially a directory, and topic consists of some partition Logs (partition log), and its organizational structure is shown in the following figure:

We can see that the messages in each partition are ordered, and the produced messages are appended to the partition log, each of which is given a unique value of offset.  
The Kafka cluster will save all messages, regardless of whether the message is consumed or not, and we can set the expiration time of the message, and only the expired data will be automatically cleared to free up disk space. For example, if we set the message expiration time to 2 days, all messages within the 2 days will be saved to the cluster, and the data will only be purged for more than two days.  
Kafka needs to maintain only one metadata-the offset value of the consumer message in partition, and the offset will add 1 for each message consumed. In fact, the state of the message is completely controlled by consumer, consumer can track and reset this offset value, so that consumer can read any location of the message.  
put the message log in the form of partition for multiple considerations, first, easy to expand in the cluster, each partition can be adjusted to adapt to its machine, and a topic can have a plurality of partition composition, So the whole cluster can adapt to any size of data, and the second is to increase concurrency because it can be read and written in partition. three, Kafka core components 3.1 replications, partitions, and leaders

As we can see from the above, the data in Kafka is persistent and can be fault-tolerant. Kafka allows the user to set the number of replicas for each topic, and the number of replicas determines that there are several brokers that hold the data written. If you set the number of replicas to 3, then a single piece of data will be stored on 3 different machines, allowing 2 machines to fail. The general recommended number of copies is at least 2, so that you can guarantee the increase or decrease, restart the machine without affecting the consumption of data. If there is a higher requirement for data persistence, you can set the number of replicas to 3 or more.   The topic in the
Kafka is stored as partition, and each topic can set its partition number, and the number of partition determines the number of logs that make up the topic. Producer in the production of data, the message is published to the various partition of topic according to a certain rule (which is customizable). The copy above will be in partition, but only one copy of partition will be elected as leader for reading and writing. &NBSP
The factors that need to be considered in relation to how to set partition values. A partition can only be consumed by a single consumer (a consumer can consume multiple partition at the same time), so if the number of partition set is less than the number of consumer, there will be no data for consumers to consume. Therefore, the number of recommended partition must be greater than the number of consumer running concurrently. On the other hand, it is recommended that the number of partition is greater than the number of cluster brokers, so that the leader partition can be distributed evenly across the broker, ultimately making the cluster load balanced. In Cloudera, every topic has hundreds of partition. It is important to note that Kafka needs to allocate some memory for each partition to cache the message data, and if the number of partition is larger, it is necessary to allocate a larger heap space for the Kafka. 3.2 Producers

Producers sends the message directly to the leader partition on the broker, without having to go through any intermediary series of route forwarding. To implement this feature, each broker in the Kafka cluster can respond to producer requests and return some meta information for topic, including which machines are alive, where topic leader partition, and what leader at this stage Partition can be accessed directly.
The producer client itself controls which partition the message is pushed to. This can be achieved by randomly allocating, implementing a class of random load balancing algorithms, or by specifying some partitioning algorithms. Kafka provides an interface for users to implement a custom partition, the user can specify a partitionkey for each message, through this key to implement some hash partition algorithm. For example, if you use the UserID as a partitionkey, messages of the same userid will be pushed to the same partition.
Pushing data in batch can greatly improve processing efficiency, Kafka Producer can send a request as a batch after accumulating the message in memory to a certain number. The number of batch sizes can be controlled by producer parameters, which can be set to the cumulative number of messages (such as 500), the cumulative time interval (such as 100ms), or the cumulative data size (64KB). By increasing the size of batch, you can reduce the number of network requests and disk IO, although specific parameter settings require a tradeoff between efficiency and timeliness.
Producers can send messages asynchronously and in parallel to Kafka, but usually producer a future response after the message is sent, returning an offset value or an error encountered during the sending process. There is a very important parameter, "ACKs", which determines the number of copies producer requires leader partition to receive the acknowledgment, and if the ACKs setting number is 0, the producer does not wait for the broker's response, so Producer cannot know if the message was sent successfully, which could result in data loss, but at the same time, the ACKs value of 0 will get the maximum system throughput.
If ACKs is set to 1, producer will get a confirmation from the broker when leader Partition receives the message, which will be more reliable because the client waits until the broker acknowledges the message. If set to -1,producer will be confirmed by the broker when all the backed up partition receive the message, this setting gives the highest reliability guarantee.
The KAFKA message has a fixed-length header and a variable-length byte array. Because the Kafka message supports byte arrays, it enables Kafka to support any user-defined serial number format or other existing formats such as Apache Avro, Protobuf, and so on. Kafka does not limit the size of a single message, but we recommend that the message size not exceed 1MB, usually the average message size is before 1~10kb.3.3 Consumers

Kafka provides two sets of consumer APIs, divided into high-level APIs and SAMPLE-API. SAMPLE-API is a low-level API that maintains a connection to a single broker, and the API is completely stateless, and each request needs to specify an offset value, so the API is the most flexible.
In Kafka, the offset value of the current read message is maintained by consumer, so consumer can decide for itself how to read the data in Kafka. For example, consumer can re-consume data that has already been consumed by resetting the value of offset. Whether or not it is consumed, Kafka will save the data for a period of time, which is configurable, and Kafka will delete the data only when it expires.
The high-level API encapsulates the access to a series of brokers in a cluster and can transparently consume a topic. It maintains the state of the consumed message itself, which is the next message each time it is consumed.
The high-level API also supports the consumption of topic as a group, and if consumers has the same group name, then Kafka is equivalent to a queue message service, and each consumer balanced consumption corresponds to the data in partition. If consumers has a different group name, then Kafka is quite a broadcast service that broadcasts all the messages in topic to each consumer.
Iv. Core characteristics of Kafka 4.1 Compression

We already know that. Kafka supports sending messages in set (batch), on the basis that Kafka also supports compression of the message collection, and the producer side can compress the message collection in gzip or snappy format. After the producer end is compressed, it needs to be decompressed at the consumer end. The advantage of compression is to reduce the amount of data transmitted, reduce the pressure on the network transmission, in the big data processing, the bottleneck is often reflected on the network rather than the CPU (compression and decompression will consume some CPU resources).
Then how to distinguish whether the message is compressed or uncompressed, Kafka added a description of the message header compressed attribute byte, the last two bits of this byte represents the encoding of the message compression, if the last two bits is 0, then the message is not compressed. 4.2 Message Reliability

In the message system, it is very important to guarantee the reliability of the message in the process of production and consumption, in the process of actual message delivery, the following three situations may occur: One message is sent to fail a message is sent multiple times ideally: exactly-once, a message was sent successfully and sent only once

There are many systems that claim they are exactly-once, but they ignore the possibility that producers or consumers may fail in the process of production and consumption. For example, although a producer successfully sent a message, but the message was lost in transit, or successfully sent to the broker, was consumer successfully removed, but the consumer in the processing of the message that was sent failed.
From the producer end: Kafka is so handled, when a message is sent, producer waits for the broker to successfully receive feedback from the message (the wait time can be controlled by parameters), if the message is lost in the way or one of the brokers hangs, Producer Will resend (we know that the Kafka has a backup mechanism that can be used to control whether to wait for all backup nodes to receive messages).
From the consumer end: the previous Partition,broker End records an offset value in the partition, which points to the next consuming message in consumer. When consumer receives the message, but hangs it up during processing, consumer can re-locate the previous message and process it with this offset value. Consumer also has permission to control the value of this offset, which is persisted to the broker side of the message to do arbitrary processing. 4.3 backup mechanism

The backup mechanism is a new feature of the Kafka0.8 version, and the backup mechanism greatly improves the reliability and stability of the Kafka cluster. With the backup mechanism, Kafka allows the nodes in the cluster to be hung out without affecting the entire cluster's work. A cluster with a backup quantity of n allows n-1 nodes to fail. In all backup nodes, there is one node as the lead node, which holds a list of other backup nodes and maintains the synchronization of the bodies between the backups. The following diagram illustrates the backup mechanism of Kafka:

4.4 Kafka High Efficiency related design 4.4.1 Persistence of messages


Kafka highly dependent on the file system to store and cache messages, and the average person thinks the disk is slow, which leads to a competitive skepticism about the persistence structure. In fact, the disk is much faster or slower than you think, which depends on how we use the disk.  
A key fact related to disk performance is that the throughput of a disk drive deviates from the search for latency, that is, linear writes are much faster than random writes. For example: In a 6 7200rpm SATA RAID-5 disk array linear write speed is probably 600m/seconds, but the random write speed of only 100k/seconds, the difference is nearly 6,000 times times. Linear reads and writes are predictable in most scenarios, so the operating system uses Read-ahead and write-behind techniques to prefetch data from large chunks of data, or to combine multiple logical writes into one uppercase physical write operation. More discussion can be found in acmqueueartical, they found that the linear reading of the disk in some cases can be faster than random access to memory.  
to compensate for this difference in performance, the modern operating system uses the idle memory as the disk cache, although there is a bit of a performance penalty when memory is reclaimed. All disk read and write operations are performed on this unified cache.  
In addition, if we build on the JVM, people familiar with Java memory Application management should be aware of the following two things: the memory consumption of an object is very high, often twice times the amount of data stored or more. As data in the heap increases, Java garbage collection becomes very expensive.

Based on these facts, it is better to take advantage of the file system and rely on page caching than to maintain a memory cache or other structure--we must at least double the available cache, by automatically accessing the available memory, and by storing a more compact byte structure instead of an object, which is likely to double again. The result of this is that on a 32GB machine, if GC penalty is not considered, there will be a maximum of 28-30GB cache. In addition, these caches will persist even if the service restarts, the in-process cache needs to be refactored in memory (10GB cache takes 10 minutes) or it requires a full cold cache boot (very poor initialization performance). It also simplifies the code, since all the logic for maintaining the cohesion between the cache and the file system is now inside the operating system, making it more efficient and accurate than the one-off in-process attempts. If your disk application is more prone to sequential reads, Read-ahead actually gets the useful data in this person's cache every time the disk reads.  
The above suggests a simple design: unlike maintaining as much memory cache as possible and flushing to the filesystem when needed, let's change our mind. Instead of calling the refresh program, all the data is immediately written to a persistent log. In fact, this simply means that the data will be transferred to the kernel page cache and later refreshed. We can add a configuration item to let the user of the system control when the data is flushed to the physical hard disk. 4.4.2 constant time performance guarantee

The design of persistent data structure in the message system is usually a B-tree or other meta-data information that can be accessed randomly by the maintainer in relation to the consumption queue. B-Tree is a good structure that can be used in transactional and non-transactional semantics. But it requires a very high cost, although the operation of the B-tree requires O (logn). Typically, this is considered equivalent to the constant time, but this is not true for disk operations. Disk pathfinding takes 10ms at a time and can only be found one at a time, so parallelization is limited.
Intuitively, a persistent queue can be built on read and append to a file, like a log solution in general. Although this structure does not support rich semantics compared to the B-tree, it has one advantage, that all operations are constant time, and that read and write do not block each other. This design has great performance benefits: The final system performance is completely unrelated to the data size, and the server can take advantage of cheap hard drives to provide efficient messaging services.
In fact, there is a point where the infinite increase in disk space without compromising performance means that we can provide features that are not available in the general messaging system. For example, messages are not deleted immediately after they are consumed, and we can keep these messages for a relatively long time (say, one weeks). 4.4.3 further improve efficiency

We have done a lot of work for efficiency. But there is a very main application scenario is: processing Web activity data, it is characterized by a very large amount of data, each time the Web browsing will produce a lot of write operations. Further, we assume that every message released will be consumed by at least one consumer, so we have to get angry and make consumption cheaper.
In addition to this, we have solved the problem of disk efficiency, and there are two more inefficient scenarios in this type of system: too many small I/O operations too many byte copies

To reduce the problem of large amounts of small I/O operations, the Kafka protocol is built around a collection of messages. Producer a network request can send a message collection instead of sending only one message at a time. In the server side is the message block in the form of appending messages to the log, consumer in the query is also a query for a large number of linear data blocks. The message collection is Messageset, and the implementation itself is a very simple API that packages a byte array or file. So for the processing of the message, there is no separate serialization and deserialization on the previous step, and the field of the message can be deserialized on demand (without deserialization if it is not needed).
Another problem that affects efficiency is the byte copy. In order to solve the problem of byte copy, Kafka designed a kind of "standard byte message", Producer, Broker, consumer share this kind of message format. KAKFA's message log is a directory file on the broker side, and these log files are written to disk Messageset in this "standard byte message" format.
Maintaining this common format is especially important for optimizing these operations: persisting the network transfer of the log block. The popular UNIX operating system provides a very efficient way to implement the data transfer between the page cache and the socket. In the Linux operating system, this approach is called: Sendfile system Call (Java provides a way to access this system invocation: Filechannel.transferto API).

In order to understand the impact of sendfile, it is necessary to understand the general path of uploading data from a file to the socket: the operating system applies data from the disk to the kernel space in the cache that is read from the kernel to the user space application writes the data back into the socket cache of the kernel space The operating system writes data from the socket cache to the network card cache to send the data

This mode of operation is obviously very inefficient, there are four copies, two system calls. If you use Sendfile, you can avoid two copies: The operating system sends data directly from the page cache to the network. So in this optimized path, only the last step is needed to copy the data to the NIC cache.
We expect multiple consumers on a topic to be a common application scenario. Using the zero-copy above, the data is only copied to the page cache once, and then it can be leveraged for each consumption without having to have the data in memory and then copying it to the kernel space each time it is read. This allows the message consumption speed to reach the network connection speed. In this way, with the combination of page caching and Sendfile, the entire Kafka cluster has almost been serviced in a cached manner, and even with a lot of consumer downstream, there is no pressure on the entire Cluster service.
For Sendfile and Zero-copy, please refer to: zero-copy v. Kafka cluster deployment 5.1

To improve performance, it is recommended to use a dedicated server to deploy the Kafka cluster, as far as possible from the Hadoop cluster, because Kafka relies on disk read-write and large page caching, and if shared with Hadoop, it affects the performance of the page cache.
The size of the Kafka cluster needs to be determined based on the hardware configuration, the number of concurrent producers, the number of copies of the data, and the length of time the data is saved.
The throughput of the disk is particularly important because the Kafka bottleneck is usually on disk.
Kafka relies on zookeeper, it is recommended to use a dedicated server to deploy zookeeper cluster, zookeeper cluster nodes using an even number of, generally recommended 3, 5, 7. Note the larger the zookeeper cluster, the slower its read and write performance, because zookeeper needs to synchronize data between nodes. A 3-node zookeeper cluster allows one node to fail, and a 5-node cluster allows 2 points to fail. 5.2 Cluster Size

There are many factors that determine the Kafka cluster needs to have storage capacity, the most accurate measure is to simulate the load to calculate, Kafka itself also provides load testing tools.
If you do not want to evaluate the cluster size by simulating an experiment, the best way is to calculate it based on the space requirements of the hard disk. I'll make an estimate based on network and disk throughput requirements.
Let's assume the following: W: How many megabytes per second R: number of copies C:consumer

In general, the Kafka cluster bottleneck is network and disk throughput, so let's evaluate the cluster's network and disk requirements first.
For each message, each copy is written once, so the overall write speed is w*r. The part of the read data is mainly from the leader synchronous message read and the consumer read from outside the cluster, so the rate of the intra-cluster reads is (R-1) *w, while the external consumer read speed is c*w, so: Write:w*r read: (R-1) *w +c*w

It should be noted that we can cache some of the data at the time of reading to reduce IO operations, if a cluster has m MB of memory, the write speed is W mb/sec, then allow m/(w*r) Second write can be cached. If the cluster has 32GB of memory and the write speed is 50mb/s, you can cache at least 10 minutes of data. 5.3 Kafka Performance Test

data structure of performance testing 5.4 Kafka in zookeeper

KAFKA data structures in Zookeeper VI, Kafka main configuration 6.1 Broker Config

Properties Default Value Description
Broker.id Required parameter, Broker's unique identity
Log.dirs /tmp/kafka-logs The directory where the Kafka data is stored. You can specify more than one directory, separated by commas, and when the new partition is created, it is stored to the directory that currently holds the fewest partition.
Port 9092 Brokerserver port number to accept client connections
Zookeeper.connect Null Zookeeper connection string in the format: Hostname1:port1,hostname2:port2,hostname3:port3. Can fill one or more, in order to improve the reliability, the recommendations are filled. Note that this configuration allows us to specify a zookeeper path to hold all the data for this Kafka cluster, and in order to separate it from other application clusters, it is recommended to specify this cluster directory in this configuration in the format: Hostname1:port1,hostname2:port2, Hostname3:port3/chroot/path. It should be noted that the parameters of the consumer should be consistent with this parameter.
Message.max.bytes 1000000 The maximum message size that the server can receive. Note that this parameter should be consistent with the maximum.message.size size of the consumer, otherwise consumers will not be able to consume because the message produced by the producer is too large.
Num.io.threads 8 The number of IO threads that the server uses to perform read and write requests, at least equal to the number of disks on the server.
Queued.max.requests 500 The I/O thread can handle the requested queue size, and if the actual number of requests exceeds this size, the network thread will stop receiving new requests.
Socket.send.buffer.bytes 100 * 1024 The So_sndbuff buffer the server prefers for socket connections.
Socket.receive.buffer.bytes 100 * 1024 The So_rcvbuff buffer the server prefers for socket connections.
Socket.request.max.bytes 100 * 1024 * 1024 The server allows the maximum requested value to prevent memory overflow and its value should be less than Java heap size.
Num.partitions 1 Default number of partition, this value is used by default if topic does not specify the number of partition at creation time, it is recommended to change to 5
Log.segment.bytes 1024 * 1024 * 1024 Segment file size, exceeding this value will automatically create a new segment, which can be overridden by topic-level parameters.
Log.roll. {ms,hours} * 7 hours The time of the new segment file, which can be overridden by topic-level parameters.
Log.retention. {ms,minutes,hours}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.