Document directory
- 1. Introduction
- 2. Related Work
- 3. Kafka architecture and design principles
Kafka refer
Http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
Http://incubator.apache.org/kafka
Http://prezi.com/sj433kkfzckd/kafka-bringing-reliable-stream-processing-to-a-cold-dark-world/
Http://sna-projects.com/blog/2011/08/kafka/
Http://sna-projects.com/sna/media/kafka_hadoop.pdf
-Https://github.com/kafka-dev/kafka/tree/master/clients, all kinds of clients of Kafka
-Chinese design documents, http://www.oschina.net/translate/kafka-design
Kafka: a distributed messaging system for log processing1. Introduction
We have built a novel messaging system for log processing called Kafka [18] ThatCombinesThe benefits of traditionalLog aggregatorsAndMessaging systems.
On the one hand, Kafka is distributed and scalable, and offers high throughput.
On the other hand, Kafka provides an API similar to a messaging system and allows applications to consume log events in real time.
It can be understood as a distributed product-consumer architecture.
2. Related Work
Since there have been so many log aggreagtor and messaging system systems, why do we still need Kafka?
Comparison with traditional Messaging System
1. MQ and JMS both have strong Delivery Guarantees functions. This is not required for log aggregator. It doesn't matter if some logs are lost, and these functions greatly increase system complexity.
2. Because there was no bigdata before, there was no focus on throughput. For example, wholesale delivery is not supported.
3. Lack of support for distributed
4. Real-time analysis is not supported. The consume speed must be very fast. Otherwise, the queue will be efficient if it is too long.
Traditional Messaging SystemTend not to be a good fit for log processing.
First, there is a mismatch in features offered by enterprise systems.
For example, IBM WebSphere MQ [7] Has transactional supports that allow an application to insert messages into multiple queues atomically. the JMS [14] specification allows each individual message to be acknowledged after consumption, potentially out of order.
Second, wait systems do not focus as strongly onThroughputAs their primary design constraint.
Third, those systems areWeakIn distributed support.
Finally, inclumessaging systems assume near immediate consumption of messages, so the queue of unconsumed messages is always fairly small.
Compared with the existing Log aggregator, pull features
A numberSpecialized log aggregatorsHave been built over the last few years.
FacebookUses a system calledScribe. Each frontend machine can send log data to a set of scribe machines over sockets. Each scribe machine aggregates the log entries and periodically dumps them to HDFS [9] or an NFS device.
Yahoo's data highwayProject has a similar dataflow. A set of machines aggregate events from the clients and roll out "Minute" files, which are then added to HDFS.
FlumeIs a relatively new log aggregator developedCloudera. It supports extensible "Pipes" and "sinks", and makes streaming log data very flexible. it also has more integrated distributed support. however, most of those systems are built for consuming the log data offline, and often expose Implementation Details unnecessarily (e.g. "minute Files") to the consumer.
Most of them use a "push" model in which the broker forwards data to consumers. at LinkedIn, we find the "pull" model more suitable for our applications since each consumer can retrieve the messages at the maximum rate it can sustain andAvoid being floodedBy messages pushed faster than it can handle.
Why should we use pull instead of push? consumer's hunger is only known by consumer, so it is reasonable for the broker to force push itself without consumer.
The previous system could not think of this. No, it is not difficult to think about it. The problem is that the previous system was based on offline consumer, and consumer directly stored data in HDFS without online analysis, therefore, consumer is not prone to the risk of being flooded. in this case, push is simpler.
3. Kafka architecture and design principles
We first introduce the basic concepts in Kafka.
A stream of messages of a particle type is defined byTopic.
AProducerCan publish messages to a topic.
The published messages are then stored at a set of servers calledBrokers.
AConsumerCan subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers.
To balance load, a topic is divided into multiplePartitionsAnd each broker stores one or more of those partitions.
Partitions partitions by topic to ensure Load Balance
This type of partition is relatively reasonable, and the topic heat is different. Therefore, if you place different topics on different brokers, load imbalance may occur.
By default, random partition is used to customize a more reasonable partition policy.
3.1 efficiency on a single Partition
Simple Storage, simple storage
Kafka has a very simple storage structure
1. The storage unit is partition, and each partition is actually a group of segment files. The reason why a group of files is used to prevent a single file from being too large
Logically, you can think that a partition is a log file, and a new message will be appended to the end of the file.
Like all file systems, all messages can be obtained by the consumer only after being flushed.
2. Use logic offset to replace Message ID to reduce overhead storage.
Kafka has a very simple storage layout.
1. Each partition of a topic corresponds to a logical log.
Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1 GB ).
Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.
A message is only exposed to the consumers after it is flushed.
2. A message stored in Kafka doesn't have an explicit message id. Instead, each message is addressed by itsLogical offsetIn the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message IDs to the actual message locations.
Efficient transfer for efficient transmission
1. Batch Sending of a group of messages improves throughput efficiency
2. Use the file system cache instead of the memory cache
3. Use sendfile to bypass the application layer buffer and directly transmit data from file to socket (provided that the application logic does not care about the sent content)
We are very careful about transferring data in and out of Kafka.
1. producer can submit a set of messages in a single send request. Consumer also retrieves multiple messages up to a certain size, typically hundreds of kilobytes.
2. Another unconventional choice that we made is to avoid explicitly caching messages in memory at the Kafka layer. Instead, we rely on the underlyingFile System page Cache.
What are the advantages of using file system page cache? See Kafka Design
First, directly using the page cache is relatively simple and efficient, so you do not need to do anything special to avoid creating a memory buffer.
In addition, as long as the disk is continuously powered, the page cache will always exist and the broker process will not be lost after restart or crash.
Finally, the most important thing is that for this scenario, Kafka supports sequential read/write.
This has the main benefit of avoiding double buffering --- messages are only cached in the page cache.
This has the additional benefit of retaining warm cache even when a broker process is restarted.
Since both the producer and the consumer access the segment files, sequentially, with the consumer often lagging the producer by a small amount, normal operating system caching heuristics are very valid tive. producer and consumer
3. We optimize the network access for consumers.
On Linux and other UNIX operating systems, there exists a sendfile API [5] that can directly transfer bytes from a file channel to a socket channel.
This saves time. (2) Copy Data in the page cache to an application buffer, (3) Copy application buffer to another kernel buffer,
Because he didn't want to use memory buffer, it is more efficient to copy data directly from the page cache to the kernel buffer.
Stateless broker, broker stateless
Unlike most other messaging systems, in Kafka, the information about how much each consumer has consumed is not maintained by the broker, but by the consumer itself.
Since consumer uses pull, it is not necessary for the broker to know the number of consumer reads. If it is push, you must know...
The problem is that you don't know when the consumer will come to pull, so when the broker deletes the message, he used a very simple method, simple time-based SLA, to delete the message in a short period of time, for example, 7 days.
There is an important side benefit of this design. A consumer can deliberately rewind back to an old offset and re-consume data.
This feature is very convenient, such as testing and profound experience. Generally, the queue will not be read once. It is very troublesome to use the same data for repeated tests. It is very convenient to change the offset for Kafka.
There is also a consumer failure, the data is not written successfully, it's okay, and the last offset can be read again. It's really good...
The problem is that Kafka does not provide the offset operation interface, which seems to be very good and is not very convenient to use.
3.2 Distributed Coordination
We now describe how the producers and the consumers behave in a distributed setting.
Each producer can publish a message to either a randomly selected partition or a partition semantically determined by a partitioning key and a partitioning function. We will focus on how the consumers interact with the brokers.
For producer, it is very easy to either randomly or send to a partition through a hash method.
Consumer is complicated. A topic has so many partitions. To ensure efficiency, multiple consumers must be used for consume. How can we ensure coordination between consumers.
Kafka has the concept of consumer groups. EachConsumer groupConsists of one or more consumers that jointly consume a set
Subscribed topics, I. e., each message is delivered to only one of the consumers within the group.
You can abstract a consumer group into a single consumer. I only need to consume each message once. The reason why group is used for concurrent operations
For different groups, a message can be consume once by each group, so coordination is not required between groups.
The problem is that the consumer between the same group needs coordinate to ensure that only one message is consumed once, and our goal is to minimize the overhead of the coordinate.
1. to simplify the design and cancel the concurrency of paritition, only the concurrency between partition is supported.
Our first demo-Is to make a partition within a topic the smallest unit of parallelism. This means that at any given time, all messages from one partition are consumed only by a single consumer within each consumer group.
One partition can only have one consumer, which avoids the locking and State maintenance overhead when multiple consumers are read.
So can a dedicated consumer be arranged for each partition, but it is too wasteful... partition is usually much larger than the number of consumer.
Therefore, a consumer needs to cover multiple partition, which leads to a problem. When the number of partition or consumer changes, we need to perform a rebalance to re-allocate the consume relationship. only at this time, we need to go to the coordinate consumer, so the overhead of coordinate is relatively low.
This designBiggest problemBecause the low speed of a single or a small number of partition will slow down the entire processing speed, because one partition can only have one consumer, and other consumers cannot help you even if they are idle.
Therefore, you must ensure that the data generated for each partition is similar to the consumption speed, otherwise there will be problems.
For example, the number of partition must be cleverly designed, because if the number of partition cannot be exclusive to the number of consumer, it will lead to an uneven
I personally think this is not a reference design and should have a better choice...
2. Use zookeeper to replace the center master
The second demo-That we made is to not have a central "master" node, but instead let consumers coordinate among themselves in a decentralized fashion.
Kafka uses zookeeper for the following tasks:
(1) detecting the addition and the removal of brokers and consumers,
(2) triggering a rebalance process in each consumer when the above events happen, and
(3) Maintaining the consumption relationship and keeping track of the consumed offset of each partition.
Specifically, when each broker or consumer starts up, It stores its information in a broker or consumer registry in zookeeper.
The broker Registry (ephemeral)Contains the broker's host name and port, and the set of topics and partitions stored on it.
The consumer Registry (ephemeral)When des the consumer group to which a consumer belongs and the set of topics that it subscribes.
The ownership Registry (ephemeral)Has one path for every subscribed partition and the path value is the ID of the consumer currently consuming from this partition (we use the terminology that the consumer owns this partition ).
The offset Registry (persistent)Stores for each subscribed partition, the offset of the last consumed message in the partition (for each consumer group ).
When the broker and consumer change, the corresponding ephemeral registry automatically follows the change, which is very simple.
At the same time, the rebalance event of the consumer will be triggered to modify or increase or decrease the ownership registry according to the rebalance result.
Only the offset registry is persistent. No matter how your consumer changes, you only need to record the offset of each group on the partition to ensure the coordinate in the group.
3. Consumer rebalance
Algorithm 1: Rebalance process for consumer Ci in group G # For a consumer in a group
For each topic t that CI subscribes to {# perform one by topic. The number of partitions for different topics is different.
Remove partitions owned by CI from the ownership registry # first clear the own relationship
Read the broker and the consumer registries from zookeeper # Read the broker and consumer registries
Compute Pt = partitions available in all brokers under topic T # retrieve T's partition list
Compute Ct = all consumers in G that subscribe to topic T # retrieve the consumer list corresponding to T
Sort PT and CT # Sort two lists
Let J be the index position of Ci in CT and let n = | Pt |/| CT | # Find the order of C in the consumer list, J
Assign partitions from J * n to (j + 1) * n-1 in PT to consumer Ci
For each assigned partition P {
Set the owner of P to Ci in the ownership registry # Change ownship
Let op = the offset of partition P stored in the offset registry # Read offset
Invoke a thread to pull data in partition P from offset OP # create a thread to concurrent handle each partition
}
}
The key to the algorithm is this formula. J * n to (j + 1) * n-1
In fact, it is very simple. If there are 10 partition and 2 consumer, how many partition should each consumer handle?
How to allocate these five partitions? According to the order of C in the consumer list
Based on this, Kafka's automatic load balancing can always ensure that each partition is evenly distributed by the consumer handle. However, if a consumer fails, other consumer will be added through rebalance.
However, although Kafka's "make a partition within a topic the smallest unit of parallelism" strategy simplifies the complexity, it also reduces the balance granularity, he cannot handle a certain partition with a lot of data such as this case, because a paritition can only have one consumer. therefore, when throwing a producer, you must ensure the balance of each partition.
The key to design,Since the partition records the offset read by the group, you can switch the read Consumer at any time. Therefore, the rebalance is just a simple re-allocation, so you do not need to consider other.
However, in rebalance, Data Reading is sometimes heavy.
The reason is that we consider the consumer instability. After the data is processed, commit the data to the broker, so that the consumer crash will not lose data.
However, when the consumer rebalance occurs, other consumers read the same data...
Partition ownership competition due to notification timing
When there are multiple consumers within a group, each of them will be notified of a broker or a consumer change.
However, the notification may come at slightly different times at the consumers.
So, it is possible that one consumer tries to take ownership of a partition still owned by another consumer. when this happens, the first consumer simply releases all the partitions that it currently owns, waits a bit and retries the rebalance process. in practice, the rebalance process often stabilizes after only a few retries.
3.3 Delivery Guarantees
In general, Kafka only guarantees at-least-Once delivery. exactly once delivery typically requires two-phase commits and is not necessary for our applications.
This is not the focus of Kafka, and does not need a commit mechanism such as two-segment commit. in general, we can ensure that exactly once delivery. However, if a consumer crashes before updating zookeeper after reading the data, the subsequent consumer may read duplicate data.
Kafka guarantees that messages from a single partition are delivered to a consumer in order. However, there is no guarantee on the ordering of messages coming from different partitions.
To avoid log upload uption, Kafka stores a CRC for each message in the log.Use CRC to prevent network errors and data tampering
If a broker goes down, any message stored on it not yet consumed becomes unavailable. If the storage system on a broker is permanently damaged, any unconsumed message is lost forever.
If the broker crash is used, data loss may occur.
In the future, we plan to add built-inReplicationIn Kafka to redundantly store each message on multiple brokers.
4. Kafka usage at LinkedIn
We have one Kafka cluster co-located with each datacenter where our userfacing services run.
First, run a Kafka cluster on the datacenter of the run service to collect data.
The frontend services generate various kinds of log data and publish it to the local Kafka brokers in batches.
We rely on a hardware load-balancer to distribute the publish requests to the set of Kafka brokers evenly.
It is important to ensure the balance of distribute the publish requests, because the unbalance cannot be compensated later.
The online consumers of Kafka run in services within the same datacenter.
For this cluster, we use online consumer for real-time analysis.
We also deploy a cluster of Kafka in a separate datacenter for offline analysis, located geographically close to our hadoop
Cluster and other data warehouse infrastructure.
Build a Kafka cluster for offline analysis near the hadoop cluster and data warehouse.
This instance of Kafka runs a set of embedded consumers to pull data from the Kafka instances in the live datacenters.
Consumer itself can be another Kafka cluster, which is very creative in usage...
We then run data load jobs to pull data from this replica cluster of Kafka into hadoop and our data warehouse, where we run various reporting jobs and analytical process on the data.