First of all, Flume and Kafka are message systems , but they also have a lot of different places, flume more toward the message acquisition system, and Kafka more toward the message cache system.
The difference in "one" design
Flume is a message acquisition system, which mainly solves the problem is the multiple collection of messages. As a result, Flume provides up to more than 10 types of flume Source for implementation, enabling users to capture data based on different application scenarios. It is also because Flume provides these flume source of the collected messages, which makes it easy for users to gather messages, and users often only need to process the raw data slightly and then send the data to Flume source. In Flume thrift source, Flume has implemented the thrift source server side, and the user simply writes the client to send the data to flume.
Kafka is a message caching system, mainly used to cache the data, the cache time can be set by the configuration file, in the cache time, the cached data whether or not consumed, will not be deleted, more than the cache time Kafka will delete the data to free up space. Unlike Flume, Flume's data is deleted once it is confirmed to be sink received.
The difference of "two" data processing
Flume received data will be actively push the data (push) to Sink,sink to confirm the receipt will be deleted from the channel, so Flume is mainly the rapid acquisition of data, the data for it is only a passer-by, flume pay attention to the speed.
Kafka data will be cached first, regardless of downstream people consume data, data is temporarily cached in the server cluster, the focus is the store.
"Three" Push vs pull
Flume is actually contains the source, channel, sink three components, thesource is used to receive data, thechannel is used to cache data, sink to send data , And it is the active push to the downstream, which causes the downstream receiver to be only one, because if there are multiple receivers downstream, the receiving rate will cause the receiver to receive less data (the channel will delete the data after sink confirmation). If you want to send data to multiple receivers, you can only let source write the data to multiple channel, and then the channel is sent to different receivers by their sink.
Kafka actually contains only broker cluster, which is used to cache the data, his producer and consumer all need to be implemented by the user; Broker cluster is more like a file system that provides storage data functions, and users can read and write data on their own. Broker cluster does not need to be concerned about how the user is implemented, so Kafka relies very little on producer and consumer, concentrating on caching. When consumer needs to subscribe to a topic data, consumer actively fetch the data, and the broker cluster passively provide the data so that it can support multiple consumer subscription data at the same time.
The theme and division of "four" Kafka and the reproduction factor
Kafka when receiving data, producer can specify a subject, Kafka classify the data by subject, and subscribe to a class of data on demand when the subscription data is consumer.
The division of Kafka is a further subdivision of the subject, such as the ability to store data in a Web application in different partitions , with a unique sequential increment of data stored within the same partition . This number is also called an offset, and the offset is saved in consumer, which is used to read the data sequentially, or to change the number to read or skip.
The replication factor is a measure to improve the fault tolerance of the Kafka cluster, the data in a partition will be copied to different brokers according to the number of replication factors, this broker is responsible for the data read and write requests within the partition, called the Master node, and the other slave nodes are responsible for following up the replication data. The number of replication factors for a topic should be determined according to your broker cluster machine.
Kafka can guarantee that the data in a partition is ordered when consumed by consumer, and that the data in different partitions is ordered, so if you want all the data to be ordered, there can be only one division.
The difference between message system flume and Kafka