Kafka is a message system contributed by LinkedIn to the Apache Foundation, known as a top-level project of Apache. Kafka was originally used as the base of the LinkedIn activity stream and operation data pipeline
Kafka is a message system contributed by LinkedIn to the Apache Foundation, known as a top-level project of Apache. Kafka was originally used as the base of the activity stream and pipeline of LinkedIn.
Kafka is a message system contributed by LinkedIn to the Apache Foundation, known as a top-level project of Apache. Kafka was initially used as the basis for LinkedIn's activity stream and operation data processing pipeline (pipeline. It features scalability, high throughput, and durability, as well as excellent features of partitioning, replication, and fault tolerance.
Key design decisions of Kafka
1 ).At the design stage, Kafka considers persistent messages as common usage.
2 ).The main design constraint of Kafka is throughput rather than function.
3 ).Kafka saves status information about which data has been used as part of the data consumer rather than on the server.
4 ).Kafka is an explicit distributed system. It assumes that data producers, brokers, and consumers are scattered on multiple machines.
In contrast, traditional message queues cannot be well supported (for example, ultra-long unprocessed data cannot be effectively persisted ). Kafka provides two guarantees for Data availability:
(1 ).Messages sent by the producer to the partition of the Topic are sent in the order they are sent, and messages received by the consumer are also in this order.
(2 ).If a Topic is configured with a replication factor (replication facto) of N, you can allow the N-1 server to crash without losing any added messages.
Several key terms in Kafka
Topic: Kafka classifies messages into different categories. Each type of messages is called a Topic ).
Producer: The message publishing object is called the topic producer (Kafka topic producer)
Consumer: The subscribe to the message and the subscribe to the published message is called the consumer)
Broker: Published messages are stored in a group of servers, which are called Kafka clusters. Each server in the cluster is a Broker. Consumers can subscribe to one or more topics and pull data from the Broker to consume these published messages.
Topic in Kafka
A Topic is the type or Seed Feed name of the published message. For each Topic, the Kafka cluster maintains the log of this partition, as shown in the following example: Kafka Cluster
Each partition is an ordered and unchangeable message queue and can be added continuously. Messages in a partition are assigned a serial number, which is called offset. In each partition, the offset is unique.
The Kafka cluster keeps all messages until they expire, regardless of whether the messages are consumed.
In fact, the only Metadata held by the consumer is the offset, that is, the location of the consumer in this log. This offset is controlled by the consumer. Normally, the offset increases linearly when the consumer consumes messages. However, the actual offset is controlled by the consumer. The consumer can reset the offset to an older offset and re-read the message.
We can see that this design is easy for consumers, and the operations of one consumer will not affect the processing of this log by other consumers.
Let's talk about partitions. The partition design in Kafka has several purposes.
I, Can process more messages, not limited by a single server. A Topic has multiple partitions, which means it can expand and process more data.
IIPartitions can be used as parallel processing units.
The partition Log of the Topic is distributed to multiple servers in the cluster. Each server processes the partitions it holds. According to the configuration, each partition can be copied to another server as a backup for fault tolerance.
Each partition has one leader, zero or multiple replica. The Leader processes all read/write requests for this partition, while the replica passively copies data. If the leader is on the machine, another replica will be elected as the new leader.
One server may be the leader of one partition and the replica of another partition. This balances the load and prevents all requests from being processed by only one or several servers.
For details about the replication principle, refer to the official translation below:
Kafka cluster replication Design
Kafka cluster deployment
Kafka has three main modes,
Standalone broker Mode
Standalone multi-broker mode (pseudo-distributed)
Multi-machine and multi-broker mode (cluster)
Like hadoop, the first two types are mostly used for development and testing. The third mode is available in actual production. The following describes how to deploy kafka clusters in three nodes.
Directly decompress the software installation:
tar xzvf kafka_2.10-0.8.1.1.tgzmkdir /var/kafka && mkdir /var/zookeeper
Key parameters can be referred to the http://debugo.com/kafka-params/
Vim kafka_2.10-0.8.1.1/config/server. properties
# In the default configuration, I only modified three places. The three hosts debugo01, debugo02, and debugo03 respectively correspond to IDs of 1, 2, and 3 broker. id = 3log. dirs =/var/kafkazookeeper. connect = debugo01: 2181, debugo02: 2181, debugo03: 2181 configure zookeeper, modify DataDir and add the cluster parameter vim kafka_2.10-0.8.1.1/config/zookeeper. propertiesinitLimit = 5 syncLimit = 2server. 1 = debugo01: 2888: 3888 server.2 = debugo02: 2888: 3888 server.3 = debugo03: 2888: 3888 dataDir =/var/zookeeper #1, 2, 3. Write the myid file echo "1">/var/zookeeper/myid to the three hosts.
Start zookeeper and kafka Server on debugo01, debugo02, and debugo03 respectively.
Bin/zookeeper-server-start.sh config/zookeeper. properties # Start kafka Serverbin/kafka-server-start.sh config/server. properties
In this case, you can find in the log that the new broker has registered the data to znode.
#####debugo01#####[2014-12-07 20:54:20,506] INFO Awaiting socket connections on debugo01:9092. (kafka.network.Acceptor)[2014-12-07 20:54:20,521] INFO [Socket Server on Broker 1], Started (kafka.network.SocketServer)[2014-12-07 20:54:20,649] INFO Will not load MX4J, mx4j-tools.jar is not in the classpath (kafka.utils.Mx4jLoader$)[2014-12-07 20:54:20,725] INFO 1 successfully elected as leader (kafka.server.ZookeeperLeaderElector)[2014-12-07 20:54:20,876] INFO Registered broker 1 at path /brokers/ids/1 with address debugo01:9092. (kafka.utils.ZkUtils$)[2014-12-07 20:54:20,907] INFO [Kafka Server 1], started (kafka.server.KafkaServer)[2014-12-07 20:54:20,993] INFO New leader is 1 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)#####debugo02#####[2014-12-07 20:54:35,896] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor)[2014-12-07 20:54:35,913] INFO [Socket Server on Broker 2], Started (kafka.network.SocketServer)[2014-12-07 20:54:36,073] INFO Will not load MX4J, mx4j-tools.jar is not in the classpath (kafka.utils.Mx4jLoader$)[2014-12-07 20:54:36,179] INFO conflict in /controller data: {"version":1,"brokerid":2,"timestamp":"1417956876081"} stored data: {"version":1,"brokerid":1,"timestamp":"1417956860689"} (kafka.utils.ZkUtils$)[2014-12-07 20:54:36,398] INFO Registered broker 2 at path /brokers/ids/2 with address debugo02:9092. (kafka.utils.ZkUtils$)[2014-12-07 20:54:36,420] INFO [Kafka Server 2], started (kafka.server.KafkaServer)#####debugo03#####[2014-12-07 20:54:43,535] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor)[2014-12-07 20:54:43,549] INFO [Socket Server on Broker 3], Started (kafka.network.SocketServer)[2014-12-07 20:54:43,728] INFO Will not load MX4J, mx4j-tools.jar is not in the classpath (kafka.utils.Mx4jLoader$)[2014-12-07 20:54:43,783] INFO conflict in /controller data: {"version":1,"brokerid":3,"timestamp":"1417956883737"} stored data: {"version":1,"brokerid":1,"timestamp":"1417956860689"} (kafka.utils.ZkUtils$)[2014-12-07 20:54:43,999] INFO Registered broker 3 at path /brokers/ids/3 with address debugo03:9092. (kafka.utils.ZkUtils$)[2014-12-07 20:54:44,018] INFO [Kafka Server 3], started (kafka.server.KafkaServer)
Partition and replication of topics
1. Create debugo01, The number of partitions of this topic is 3, and the copy value is 1 (no copy ). This topic spans all brokers. The following management commands can be executed on any kafka node.
bin/kafka-topics.sh --create --zookeeper debugo01,debugo02,debugo03 --replication-factor 1 --partitions 3 --topic debugo01Created topic "debugo01".
2. Create debugo02The number of partitions of this topic is 1, and the number of copies is 3 (each host has one copy ). This topic spans all brokers. The following management commands can be executed on any kafka node.
bin/kafka-topics.sh --create --zookeeper debugo01,debugo02,debugo03 --replication-factor 3 --partitions 1 --topic debugo02
3. List topic information
[root@debugo01 kafka_2.10-0.8.1.1]# bin/kafka-topics.sh --list --zookeeper localhost:2181debugo01debugo02
4. List topic descriptions
[root@debugo01 kafka_2.10-0.8.1.1]# bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic debugo01Topic:debugo01PartitionCount:3ReplicationFactor:1Configs:Topic: debugo01Partition: 0Leader: 1Replicas: 1Isr: 1Topic: debugo01Partition: 1Leader: 2Replicas: 2Isr: 2Topic: debugo01Partition: 2Leader: 3Replicas: 3Isr: 3
5. Check the log directoryFor topic debugo01, debugo01 is shard 0, and debugo02 is shard 1. Topic debugo02 copies three copies, all of which are shard 0.
[Root @ debugo01 kafka] # lltotal 24drwxr-xr-x 2 root 4096 Dec 7 debugo01-0drwxr-xr-x 2 root 4096 Dec 7 debugo02-0 [root @ debugo02 kafka] # lltotal 24drwxr-xr-x 2 root 4096 Dec 7 debugo01-1drwxr-xr-x 2 root 4096 Dec 7 debugo02-0 # And under each partition are generated index and log file [root @ debugo01 debugo01-0]
6. the following topic debugo03, Replication-factor is 2, and partition is 3. debugo01 with broker id 1 will save partition 0 and partition 1, as shown in describe below.
The repica leader of partition 0 is broker id = 3, which contains two replicas: 3 and 1.
bin/kafka-topics.sh --create --zookeeper debugo01,debugo02,debugo03 --replication-factor 2 --partitions 3 --topic debugo03Created topic "debugo03".bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic debugo03[root@debugo01 kafka_2.10-0.8.1.1]# bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic debugo03Topic:debugo03PartitionCount:3ReplicationFactor:2Configs:Topic: debugo03Partition: 0Leader: 3Replicas: 3,1Isr: 3,1Topic: debugo03Partition: 1Leader: 1Replicas: 1,2Isr: 1,2Topic: debugo03Partition: 2Leader: 2Replicas: 2,3Isr: 2,3[root@debugo01 kafka_2.10-0.8.1.1]# ll /var/kafka/debugo03*/var/kafka/debugo03-0:total 0-rw-r--r-- 1 root root 10485760 Dec 7 21:34 00000000000000000000.index-rw-r--r-- 1 root root 0 Dec 7 21:34 00000000000000000000.log/var/kafka/debugo03-1:total 0-rw-r--r-- 1 root root 10485760 Dec 7 21:34 00000000000000000000.index-rw-r--r-- 1 root root 0 Dec 7 21:34 00000000000000000000.log
Message generation and consumption
Enable producer and consumer on both terminals for testing.
bin/kafka-console-producer.sh --broker-list debugo01:9092 --topic debugo03hello kafkahello debugo
bin/kafka-console-consumer.sh --zookeeper debugo01:2181 --from-beginning --topic debugo03hello kafkahello debugo
The following uses the perf command to test the performance of several topics, You need to first download the kafka-perf_2.10-0.8.1.1.jar, and copy to kafka/libs.
1000 messages, 1000 bytes each, batch size, topic is debugo01, and 4 threads (adjust the relevant parameters when the message size is too large, otherwise it is easy to OOM ). It took only 13 seconds to complete, and kafka's throughput was very powerful with multi-partition support.
bin/kafka-producer-perf-test.sh --messages 500000 --message-size 1000 --batch-size 1000 --topics debugo01 --threads 4 --broker-list debugo01:9092,debugo02:9092,debugo03:9092start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec2014-12-07 22:07:56:038, 2014-12-07 22:08:09:413, 0, 1000, 1000, 476.84, 35.6514, 500000, 37383.1776
The same parameter test debugo02 takes 39 seconds because the partition is added with replication (replicas-factor = 3. Therefore, increasing the number of partition and the number of broker-related threads will greatly improve performance.
bin/kafka-producer-perf-test.sh --messages 500000 --message-size 1000 --batch-size 1000 --topics debugo02 --threads 4 --broker-list debugo01:9092,debugo02:9092,debugo03:9092start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec2014-12-07 22:13:28:840, 2014-12-07 22:14:07:819, 0, 1000, 1000, 476.84, 12.2332, 500000, 12827.4199
The same parameter test debugo03 takes 30 seconds.
bin/kafka-producer-perf-test.sh --messages 500000 --message-size 1000 --batch-size 1000 --topics debugo03 --threads 4 --broker-list debugo01:9092,debugo02:9092,debugo03:9092start.time, end.time, compression, message.size, batch.size, total.data.sent.in.MB, MB.sec, total.data.sent.in.nMsg, nMsg.sec2014-12-07 22:16:04:895, 2014-12-07 22:16:34:715, 0, 1000, 1000, 476.84, 15.9905, 500000, 16767.2703
Similarly, test the performance of comsumer.
bin/kafka-consumer-perf-test.sh --zookeeper debugo01,debugo02,debugo03 --messages 500000 --topic debugo01 --threads 3start.time, end.time, fetch.size, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec2014-12-07 22:19:04:527, 2014-12-07 22:19:17:184, 1048576, 476.8372, 62.2747, 500000, 65299.7257bin/kafka-consumer-perf-test.sh --zookeeper debugo01,debugo02,debugo03 --messages 500000 --topic debugo02 --threads 3start.time, end.time, fetch.size, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec[2014-12-07 22:19:59,938] WARN [perf-consumer-78853_debugo01-1417961999315-4a5941ef], No broker partitions consumed by consumer thread perf-consumer-78853_debugo01-1417961999315-4a5941ef-1 for topic debugo02 (kafka.consumer.ZookeeperConsumerConnector)[2014-12-07 22:19:59,938] WARN [perf-consumer-78853_debugo01-1417961999315-4a5941ef], No broker partitions consumed by consumer thread perf-consumer-78853_debugo01-1417961999315-4a5941ef-2 for topic debugo02 (kafka.consumer.ZookeeperConsumerConnector)2014-12-07 22:20:01:008, 2014-12-07 22:20:08:971, 1048576, 476.8372, 160.9305, 500000, 168747.8907bin/kafka-consumer-perf-test.sh --zookeeper debugo01,debugo02,debugo03 --messages 500000 --topic debugo03 --threads 3start.time, end.time, fetch.size, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec?2014-12-07 22:21:27:421, 2014-12-07 22:21:39:918, 1048576, 476.8372, 63.6037, 500002, 66693.6108
^
Reference
Http://blog.csdn.net/smallnest/article/details/38491483
Http://www.350351.com/jiagoucunchu/xiaoxixitong/46720.html
Http://kafka.apache.org/documentation.html
Http://backend.blog.163.com/blog/static/202294126201431723734212/
Http://www.inter12.org/archives/842
Original article address: Kafka principles and cluster testing. Thank you for sharing your feedback.