I. Kafka INTRODUCTION
Kafka is a distributed publish-subscribe messaging system. Originally developed by LinkedIn, it was written in the Scala language and later became part of the Apache project. Kafka is a distributed, partitioned, multi-subscriber, redundant backup of the persistent log service. It is mainly used for the processing of active streaming data (real-time computing).
In big Data system, often encounter a problem, the whole big data is composed of each subsystem, the data needs in each subsystem of high performance, low-latency continuous flow. Traditional enterprise messaging systems are not ideal for large-scale data processing. In order to have both online applications (messages) and offline applications (data files, logs) Kafka appeared. The Kafka can play two roles:
1. Reduce the complexity of system networking.
2. Reduce the complexity of programming, each subsystem is not a mutual negotiation interface, each subsystem similar socket inserted in the socket, Kafka assume the role of high-speed data bus.
Two. Main features of Kafka
1. Provide high throughput for both publishing and subscriptions. It is understood that the Kafka can produce about 250,000 messages per second (in megabytes), processing 550,000 messages per second (in megabytes).
2. Persistent operation is possible. persist messages to disk, so it can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.
3. Distributed system, easy to scale out, can be combined with zookeeper. all producer, brokers, and consumer will have multiple, distributed. Extend the machine without downtime.
4. The status of the message being processed is maintained on the consumer side, not by the server side . can automatically balance when it fails.
5. Support for online and offline scenarios.
Three. Why use a messaging system
Communication between systems can be done through Message Queuing, that is, coordination and invocation between systems
Note: What is the difference between using Message Queuing and the SOA architecture?
1.SOA is called directly (can be called directly via RPC and HttpClient)
2. The use of Message Queuing is through the delivery of messages to complete the consolidation and invocation between the two systems
Benefits of:
1. Decoupling
With the use of Message Queuing, there is no direct call relationship between the two systems, only through the delivery of the message to interact, the two systems are not intrusive.
2. Improve the response speed of the system
Example: Order Processing
Order Payment Successful method () {
1, modify order status
& nbsp 2. Calculation of member points
3, notification logistics delivery
}
& nbsp; note:
1. The three steps in the system are processed at the same time and then returned, which is time consuming;
 2. You can now deal with the user's most concern, the most urgent need to see the Change Order status success information, so that the "Modify order status" can be processed first, and then immediately return to the user,
"Calculate loyalty Points", "Notify Logistics for distribution", put in the message queue to the back of the system to continue processing.
Redundancy
In some cases, the process of processing data fails. Unless the data is persisted, it is lost. Message Queuing persists the data until it has been fully processed, bypassing the risk of data loss in this way. In the insert-get-delete paradigm used by many message queues, it is necessary for your processing system to explicitly indicate that the message has been processed before it is removed from the queue, ensuring that your data is safely saved until you are finished using it.
Scalability
Because Message Queuing decouples your processing, it is easy to increase the number of messages queued and processed, as long as additional processing is required. No need to change the code, do not need to adjust parameters. Expansion is as simple as adjusting the power button.
Flexibility & Peak Handling capability
Applications still need to continue to function in the event of a surge in traffic, but such bursts are not common, and it is a huge waste to be ready to invest in resources that can handle such peak access. Using Message Queuing enables critical components to withstand burst access pressure without crashing completely due to sudden and overloaded requests.
Recoverability
When a part of the system fails, it does not affect the entire system. Message Queuing reduces the degree of coupling between processes, so even if a process that processes messages is hung up, messages queued to the queue can still be processed after the system resumes.
Order Guarantee
In most usage scenarios, the order of data processing is important. Most message queues are inherently sorted and ensure that the data is handled in a specific order. Kafka guarantees the ordering of messages within a partition.
Buffer
In any important system, there will be elements that require different processing times. For example, loading a picture takes less time than applying a filter. Message Queuing uses a buffer layer to help the task perform the most efficient execution ——— the processing of the write queue is as fast as possible. This buffering helps to control and optimize the speed of the data flow through the system.
Asynchronous communication
Many times, users do not want or need to process messages immediately. Message Queuing provides an asynchronous processing mechanism that allows a user to put a message into a queue, but does not immediately process it. How many messages you want to put into the queue, and then deal with them when you need them.
Four. Classification of message queues
Classification of Message Queuing: Point-to-point, publish/Subscribe
1. Point-to-point
Message producer production messages are sent to the queue, then the message consumer takes out the queue and consumes the message
Note (disadvantage):
1. After the message is consumed, there is no more storage in the queue, so consumers are not willing to consume the information that has been consumed.
There are multiple consumers in 2.queue, but for a message, only one consumer can consume it.
(When a system consumes the message, the other system can no longer consume it.)
2. Publish/subscribe (most commonly used)
The message producer (release) publishes the message to topic and has multiple message consumers (subscriptions) consuming the message. Unlike point-to-point, messages posted to topic are consumed by all subscribed consumers.
Five. Common Message Queuing comparisons
1.RabbitMQ: Supports a number of protocols, very heavyweight Message Queuing, good support for routing (Routing), load balancing (payload balance), or data persistence.
2.ZeroMQ: The fastest Message Queuing system, especially for high-throughput demand scenarios, excels at advanced/complex queues, but the technology is also complex and provides only non-persistent queues.
3.ActiveMQ (Implementation of JMS): A subkey under Apache, similar to ZEROMQ, is able to implement queues with agent and peer-to-peer technology.
4.Redis: is a key-value NoSQL database, but also supports MQ function, data volume is small, performance is better than RABBITMQ, data more than 10K is slow unbearable.
NOTE: Message Queuing cannot be a single point, but it also requires clustering. This involves load balancing and the persistence of messages
Six. Kafka test results
Resources:
"Knowledge Education" Apache Kafka
Kafka (i): Kafka Background and architecture introduction