Read the original
Absrtact: First, some important design ideas of Kafka: 1. Consumergroup: Each consumer can be composed of a group of Zuche, each message can only be a group of consumer consumption, if a message can be multiple consumer consumption, then these consumer must be in different groups.
First, some important design ideas of Kafka:
1. Consumergroup: Each consumer can be composed of a group of Zuche, each message can only be a group of consumer consumption, if a message can be multiple consumer consumption, then these consumer must be in different groups.
2. Message status: In Kafka, the state of the message is saved in consumer, and broker does not care which message is consumed by WHO, and only one offset value (pointing to the next message location in the partition to be consumed), This means that if the consumer is not handled properly, a message on broker may be consumed multiple times.
Message persistence: In Kafka, messages are persisted to the local file system.
Message expiration: Kafka will keep the message in it for a long time so that consumer can consume it multiple times, and of course many of the details are configurable
Bulk Send: Kafka supports batch sending in message collection to improve push efficiency.
Kafka the relationship between broker in a cluster: not a master-slave relationship, where each broker is in a cluster, we can add or remove any broker node at will.
The partitioning mechanism Partition:kafka the broker side of the message partition, producer can decide which partition to send the message to, in a partition the order of messages is producer the order of messages sent, a topic can have multiple partitions, the number of specific partitions is configurable. The meaning of zoning is very significant, the later content will gradually reflect.
Second, Kafka architecture components
Topic: The directory where messages are stored is the topic
Producer: Production news to the topic party
Consumer: A party subscribing to topic consumer messages
The Broker:kafka service instance is a broker
Third, Kafka topic&partition
Multiple partition can be set for each topic of the Kafka.
Iv. Kafka Core Components
1.Replications, partitions and leaders
The data in Kafka is persistent, and there are fault-tolerant mechanisms, and each topic of the Kafka can be set up with multiple copies and stored in different broker.
The topic in Kafka is stored in the form of partition, each topic can set its partition number, and the number of partition determines the number of logs that make up topic. Producer when producing data, the message is published to topic's Partition in accordance with certain rules, which can be customized. The above copies are all in partition, but only one partition copy will be elected as leader for reading and writing.
Factors to consider when setting the partition value. A partition can only be consumed by a consumer (a consumer can consume multiple partition at the same time), so if the number of partition set is less than the number of consumer, consumers will not be able to consume data. Therefore, the number of recommended partition must be greater than the number of consumer running concurrently. On the other hand, it is recommended that the number of partition is greater than the number of cluster broker, so that the leader partition can be evenly distributed across the broker, which ultimately makes the cluster load balanced. In Cloudera, every topic has hundreds of partition. It should be noted that Kafka needs to allocate some memory for each partition to cache message data, and if the partition number is larger, it is necessary to allocate a larger heap space for Kafka.
2.Producers
Producers sends a message directly to the leader partition on broker, without any mediation of a series of routing forwarding. To implement this feature, each broker in the Kafka Cluster responds to producer requests and returns some of topic's meta information, including which machines are alive, topic leader Partition, and which leader Partition can be accessed directly.
The producer client controls which partition the message is pushed to. The method can be distributed randomly, implement a kind of stochastic load equalization algorithm, or designate some partitioning algorithms. Kafka provides the interface for users to implement custom partitions, users can specify a partitionkey for each message, through this key to implement some hash partitioning algorithm. For example, the same UserID message will be pushed to the same partition as the Partitionkey of UserID.
Pushing data in a batch way can greatly improve processing efficiency, Kafka Producer can send requests as a batch after a certain number of messages are accumulated in memory. The number of batch can be controlled by producer parameters, and the parameter values can be set to the number of cumulative messages (such as 500), cumulative intervals (such as 100ms), or cumulative data size (64KB). By increasing the size of the batch, you can reduce the number of network requests and disk IO, although the specific parameter settings require a trade-off between efficiency and timeliness.
Producers can asynchronously send messages to Kafka in parallel, but usually producer gets a future response after the message is sent, returning an offset value or an error encountered during the send process. There is a very important parameter "ACKs", which determines the number of producer requests leader partition receive confirmation, if the ACKs setting number is 0, which means that producer will not wait for broker's response, so Producer cannot tell if a message was sent successfully, which could result in data loss, but at the same time, the ACKs value of 0 would get the maximum system throughput.
If the ACKs is set to 1, it means that producer will get a confirmation of the broker when the message is received leader partition, which can be more reliable because the client waits until the broker confirms that the message is received. If set to -1,producer will be confirmed by broker when all the backup partition receive the message, this setting can obtain the highest reliability guarantee.
The KAFKA message has a fixed header and a variable length byte array. Because the Kafka message supports byte arrays, it allows Kafka to support any user-defined serial number format or other existing formats such as Apache Avro, Protobuf, and so on. Kafka does not qualify a single message size, but we recommend that the message size not exceed 1MB, usually the general message size before 1~10kb.
3.consumer
Kafka provides two sets of consumer APIs, divided into high-level APIs and SAMPLE-API. Sample-api is a low-level API that maintains a connection to a single broker, and the API is completely stateless, and each request needs to specify an offset value, so this API is also the most flexible.
In Kafka, the offset value that is currently read to the message is maintained by consumer, so consumer can decide for itself how to read the data in the Kafka. For example, consumer can reuse data that has already been consumed by resetting the offset value. Whether or not consumed, Kafka saves the data for a period of time, which is configurable and Kafka deletes the data only when it expires.
The high-level API encapsulates access to a range of broker in the cluster, and can transparently consume a topic. It maintains the state of the consumed message itself, that is, the next message is consumed each time.
The high-level API also supports consuming topic as a group, and if consumers has the same group name, then Kafka is equivalent to a queue messaging service, while each consumer evenly consumes the data in the corresponding partition. If consumers has a different group name, then Kafka is quite with a broadcast service that broadcasts all messages in topic to each consumer.
Five, Kafka core characteristics
1. Compress
We already know that. Kafka supports sending messages as a set (batch), on the basis of which Kafka also supports the compression of message sets, which can be compressed by the producer side in either gzip or snappy format. After the producer end is compressed, decompression is required at the consumer end. The advantage of compression is to reduce the amount of data transmitted, reduce the pressure on the network transmission, in the large data processing, bottlenecks are often reflected in the network rather than the CPU (compression and decompression will consume part of the CPU resources).
So how to tell if the message is compressed or uncompressed, Kafka adds a description of the compressed attribute byte in the message header, the last two digits of this byte indicate the encoding of the message, and if the latter two digits are 0, the message is not compressed.
2. Message Reliability
From the producer side: Kafka is handled so that when a message is sent, producer waits for broker to receive feedback successfully (through parameters to control the wait time), if the message is lost on the way, or if one of the broker hangs out, Producer Will resend (we know that Kafka has a backup mechanism that can be controlled by parameters to wait for all backup nodes to receive messages).
From the consumer side: The Partition,broker End records a value of offset in partition, which points to consumer next consumer message. When consumer receives the message but hangs it in the process, consumer can find the last message again with this offset value. Consumer also has permission to control the value of the offset and to handle the message persisted to the broker side arbitrarily.
Read the original