Install Kafka cluster in Centos
Kafka is a distributed MQ system developed and open-source by LinkedIn. It is now an incubator project of Apache. On its homepage, kafka is described as a high-throughput distributed MQ that can distribute messages to different nodes. In this blog post, the author briefly mentioned the reasons for developing kafka without selecting an existing MQ system. Two reasons: Performance and scalability. Kafka is compiled by only 7000 lines of Scala. It is understood that Kafka can produce about 0.25 million messages per second (50 MB) and process 0.55 million messages per second (110 MB ).
Install the prepared version
Kafka: kafka_2.10-0.8.2.0
Zookeeper version: 3.4.6
Zookeeper cluster: hadoop104, hadoop107, hadoop108
For how to build a Zookeeper cluster, see installing ZooKeeper cluster on CentOS.
Physical Environment
Install two hosts:
192.168.40.104 hadoop104 (run 3 brokers)
192.148.40.105 hadoop105 (run 2 brokers)
This cluster is mainly divided into three steps: Single-node single-Broker, single-node multi-Broker, multi-node multi-Broker
Single-node single Broker
This section uses creating a Broker on hadoop104 as an example.
Download kafka
Http://kafka.apache.org/downloads.html download path
[Html] view plaincopyprint?
- # Tar-xvfkafka_2.10-0.8.2.0.tgz
- # Cdkafka_2.10-0.8.2.0
Configuration
Modify config/server. properties
[Html] view plaincopyprint?
- Broker. id = 1
- Port = 9092
- Host. name = hadoop104
- Socket. send. buffer. bytes = 1048576
- Socket. receive. buffer. bytes = 1048576
- Socket. request. max. bytes = 104857600
- Log. dir =./kafka1-logs
- Num. partitions = 10
- Zookeeper. connect = hadoop107: 2181, hadoop104: 2181, hadoop108: 2181
Start the Kafka Service
[Html] view plaincopyprint?
- # Bin/ kafka-server-start.shconfig/server. properties
Create a Topic
[Html] view plaincopyprint?
- # Bin/kafka-topics.sh -- create -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181 -- replication-factor1 -- partitions1 -- topictest
View topics
[Html] view plaincopyprint?
- # Bin/kafka-topics.sh -- list -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181
Output:
Producer sends messages
[Html] view plaincopyprint?
- # Bin/kafka-console-producer.sh -- broker-listlocalhost: 9092 -- topictest
Consumer receives messages
[Html] view plaincopyprint?
- # Bin/kafka-console-consumer.sh -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181 -- topictest -- from-beginning
If you want the latest data, you can just remove the -- from-beginning parameter.
#/Bin/kafka-console-consumer.sh -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181 -- topic test
Multiple brokers in a Single Node
Configuration
Copy the folders in the previous chapter to kafka_2 and kafka_3.
[Html] view plaincopyprint?
- # Cp-rkafka_2.10-0.8.2.0kafka_2
- # Cp-rkafka_2.10-0.8.2.0kafka_3
Modify the broker. id and port attributes in kafka_2/config/server. properties and kafka_3/config/server. properties respectively to ensure uniqueness.
[Html] view plaincopyprint?
- Kafka_2/config/server. properties
- Broker. id = 2
- Port = 9093
- Kafka_3/config/server. properties
- Broker. id = 3
- Port = 9094
Start the other two brokers [html] view plaincopyprint?
- # Cdkafka_2
- # Bin/kafka-server-start.shconfig/server. properties &
- # Cd ../kafka_3
- # Bin/kafka-server-start.shconfig/server. properties &
Create a topic [html] view plaincopyprint with replication factor 3?
- # Bin/kafka-topics.sh -- create -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181 -- replication-factor3 -- partitions1 -- topicmy-replicated-topic
View the Topic status
[Html] view plaincopyprint?
- Bin/kafka-topics.sh -- describe -- zookeeperhadoop107: 2181, hadoop104: 2181, hadoop108: 2181 -- topicmy-replicated-topic
From the above content, we can see that the topic contains 1 part, replicationfactor is 3, and Node3 is leador:
- "Leader" is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selected portion of the partitions.
- "Replicas" is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are currently alive.
- "Isr" is the set of "in-sync" replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.
Let's take a look at the created test topic. It can be seen that multiple brokers without multiple replication nodes decompress the downloaded files to the kafka_4 and kafka_5 folders on hadoop105, then, set the server on hadoop104. copy the properties configuration file to the connected folder [html] view plaincopyprint?
- # Scp-rconfig/root @ hadoop105:/root/hadoop/kafka_4/
- # Scp-rconfig/root @ hadoop105:/root/hadoop/kafka_5/
Configure and modify the content as follows: [html] view plaincopyprint?
- Kafka_4
- Brokerid = 4
- Port = 9095
- Host. name = hadoop105
- Kafka_5
- Brokerid = 5
- Port = 9096
- Host. name = hadoop105
Start the service [html] view plaincopyprint?
- # Cdkafka_4
- # Bin/kafka-server-start.shconfig/server. properties &
- # Cd ../kafka_5
- # Bin/kafka-server-start.shconfig/server. properties &
Up to now, five brokers on two physical machines have been started.
Summary
In the core idea of kafka, data does not need to be cached in the memory, because the file cache of the operating system is perfect and powerful enough, as long as no random write is required, sequential read/write performance is very efficient. The data of kafka is only appended sequentially. The data deletion policy is to accumulate to a certain extent or to delete the data after a certain period of time. Another unique feature of Kafka is to store consumer information on the client rather than the MQ server, so that the server does not need to record the message delivery process, each client knows where to read the message next time. The message delivery process also uses the client's active pull model, which greatly reduces the burden on the server. Kafka also emphasizes reducing the serialization and copy overhead of data. It organizes some messages into Message sets for batch storage and sending, and when the client is running pull data, try to transmit data in zero-copy mode and use sendfile (corresponding to FileChannel in java. an advanced IO function such as transferTo/transferFrom to reduce the copy overhead. It can be seen that kafka is a well-designed MQ system specific to some applications. I estimate that more and more MQ systems tend to be in favor of specific fields and consider vertical product policy values.
As long as the disk is not limited and there is no loss, kafka can store messages for a long period of time (one week ).