1, why should there be Kafka? [from HRQ]
Kafka is a messaging system that was originally developed from LinkedIn as the basis for the activity stream of LinkedInand the Operational Data processing pipeline (pipeline). Now primarily used as datapipeline and messaging systems
Kafka reasons for the occurrence:
• Traditional log file statistics analysis is good for offline processing (such as reporting and batching), but it is too time-delayed for real processing, and also has a high degree of operational complexity.
The existing Message Queuing system is ideal for use in real-time or near-real-time (near-real-time) situations, but they are poor in data persistence and are not suitable for offline systems such as Hadoop.
Kafka is designed to take into account the traditional log file is good at offline processing and the existing Message Queuing system is good at online processing of the 2 advantages of design, its purpose is to become a queue platform, only use it can support both offline and support online use both cases.
Appendix Description-What is activity flow data & Operational Data
Activity Flow Data : The most common part of the data that all sites use to make reports about their site usage. Activity data includescontent such as page views, information about the content being viewed, and search conditions. This data is typically handled by writing various activities to a file in the form of a log, and then periodically analyzing the files in a statistical manner.
Operational Data : Server performance data (CPU,IO Utilization, request time, service log, and so on ). A wide variety of statistical methods for operational data
2. Main design elements
A) Kafka is designed to consider persistent messages as a common use case.
b) The main design constraints are throughput rather than functionality.
c) State information about what data has been used is saved as part of the data consumer (consumer) rather than stored on the server.
D) Kafka is an explicit distributed system. It assumes that data producers (producer), proxies (brokers), and data consumers (consumer) are scattered over multiple machines.
3.Kafka System features
Kafka System Work Flow
The workflow of a general messaging system is that a message is published by the message producer (producer) about a topic (topic) , which means, The message is sent in a physical way to the server acting as the broker (possibly another machine). Several message consumers (consumer) Subscribe (Subscribe) to a topic, and then each message that the producer publishes will be sent to all users.
Kafka features in the original general message system to change the user, the original single user to expand to " user Group"(consumer group), Both queue and topic semantics are supported. (This is the same idea as the "consumer cluster" in the Message Queuing system).
Figure 1:topologies formed by each system after deployment in LinkedIn
a)
Persistence of
(that is, the storage and caching of messages)
The persistence of Kafka only uses the operating system's file cache. Kafka that there is no need to cache any data in memory, the operating system's file cache is perfect and powerful (in some cases sequential disk access can be faster than memory access), and using the file system and relying on the page cache is better than maintaining a cache or any other structure in memory. So Kafka only uses the OS's file cache for message persistence (message sequential read-write), and unlike the general messaging system, which uses "Memory cache +os file cache" to persist, the persistence of Kafka relies heavily on the file system.
Kafka Durable Design Solutions
Instead of saving as much data as possible in memory and flushing the data to the filesystem when needed, itdoes the exact opposite, that all data is immediately written to the file system persistence log, but no calls are made to refresh the data. That is, the data is transferred to the OS kernel's page cache, and theOS then flushes the data to disk.
Persistence to OS file cache benefits
This cache will remain in effect even after a service restart, rather than a memory cache (that is, the process cache), which requires a cache rebuild in memory after the process restarts, or it needs to start running with a full empty cache.
L greatly simplified the code because all the logic for maintaining consistency between the cache and the file system is now implemented in the OS, which is more efficient and more accurate than the one-time caches we do in the process.
b)
Maximum Efficiency
Kafka focuses on message usage rather than the generation of messages during system optimization. There are two common causes of inefficiencies: Excessive network requests and a large number of byte copy operations. Kafka improves efficiency in two ways: by organizing messages into a message set for bulk storage and delivery, and by 0 copies to reduce the serialization and copying overhead of the data.
The L API is designed around this "message set"abstract mechanism. Instead of sending only a single message at a time, the message set groups messages in a natural grouping, sending messages in groups (that is, network requests).
L 0 Copy scheme: When sending data, transfer data from the disk file system to the page cache and then directly to the socket instead of through the process memory cache (see appendix below). Kafka uses advanced IO functions such as sendfile (corresponding to Filechannel.transferto/transferfrom in Java) to reduce copy overhead.
Appendix Description-Transfer data from file to socket data path
1) The operating system reads the data from the disk into the page cache in the kernel space
2) The application reads data from the kernel space into the buffer of the user space (Kafka skip this step)
3) The application writes the read data back to the kernel space and puts it into the Socke buffer (Kafka skip this step)
4) The operating system copies data from the socket's buffer to the buffer of the NIC (Network excuse card, or NIC), from which the data can be sent over the network
c)
end-to-end compression
Without support from Kafka, users can always compress messages themselves and transfer them, but the compression rate is very low. Efficient compression requires multiple messages to be compressed together instead of compressing each message individually.
Kafka provides end-to-end compression support by running a recursive message set. That is, the data is compressed before the message producer sends it, and then the compression state is saved on the server, and it needs to be decompressed only to the end-user of the message.
d)
Consumer
status tracking (meta data maintenance)
Kafka 2 Unusual things about meta data:
1) In most messaging systems, the metadata that tracks the consumer state is maintained by the server. In Kafka, however, the metadata is maintained by the client consumer rather than by the server broker. The consumer saves state information to the data store (Datastore) that holds their message processing results, rather than to zookeeper.
Appendix Description-Server Broker Maintenance metadata issues:
After the message is sent out, the client does not receive the message, and the sending messages are lost. To solve this problem, the general messaging system adds a confirmation function (similar to the 2-time handshake mechanism), but creates a new problem:
L The client has received a message but the server has not received a confirmation signal, the same message may be received 2 times;
L decreased performance, the agent must maintain multiple states for each individual message.
Appendix Description--The benefit of the user consumer to store state information together with message processing results:
L can be done in the same transaction, eliminate the distributed consistency problem, keep the message index and message state in sync
The user can use intentional fallback (rewind) to the previous offset, again using the previously used data.
2) Agent broker divides the data stream into a set of separate partitions. The semantics of these partitions are defined by the producer, which specifies which partition each message belongs to. The advantage of this is that there is no need to save a single piece of metadata for each message, and the use of (consumer, topic, partition) saves each client consumer state, greatly reducing the hassle of maintaining each message state.
Appendix Description--Semantic partition interpretation
For example, for each member to count the total number of personal space visitors.
All individual spatial Access event message flows for a member should be sent to the same partition, so all updates to one member can be placed in the same event stream in the same consumer thread. Kafka uses a semantic partitioning function by the producer to partition the message flow by one of the key values in the message and send the different partitions to their respective proxies.
e)
Push
and pull which way good
L Two ways of explaining
Pull method: Let the user from the agent to pull the data down.
Push mode: Let the agent push (push) the data to the user.
L Two ways of comparison
Pull mode is more commonly used, more suitable for the user's processing speed is slightly behind the situation, but also allows users to be able to catch up before the time, to avoid overloading users, so that users can use the data at its maximum rate;
The push method is less used and the system currently used has scribe and Flume. In push systems, users tend to be overloaded when the rate of data usage is lower than the rate of production.
L Kafka design ideas (using traditional ideas)
The Kafka message delivery process uses the client active pull model, which greatly reduces the burden on the server. The producer will push the data to the agent, and then the user will pull the data agent down.
f) Producers
L Automatic load Balancing: Try to ensure the average number of messages received by each agent. Kafka is based on zookeeper for load balancing, allowing producers to dynamically discover new agents and load balance on the requested number.
• Semantic partitioning of data streams: partitioning the data by certain key values (key) (partition) rather than randomly dividing them, thus preserving the association with the user. Message Flow Recognition: node /topic /partition
L Asynchronous Send: This method of asynchronous buffering helps to generate uniform and consistent traffic, thus better network utilization and higher throughput.
g) support for Hadoop and other bulk data loads
The scalable persistence scheme enables Kafka to support bulk data loading and periodically load snapshot data into offline systems for batch processing. We use this feature to load data into our Data Warehouse (warehouse) and Hadoop clusters. This is a viable solution for the same log data and offline analysis system as Hadoop, but requires real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, providing real-time consumption through the cluster machine.
Distributed messaging system: Kafka