"Reprint" Kafka Principle of work

Source: Internet
Author: User

Http://www.ibm.com/developerworks/cn/opensource/os-cn-kafka/index.html Message Queuing

Message Queuing technology is a technique for exchanging information among distributed applications. Message Queuing can reside in memory or on disk, and queues store messages until they are read by the application. With Message Queuing, applications can execute independently-they do not need to know each other's location, or wait for the receiving program to receive this message before proceeding. In distributed computing environments, in order to integrate distributed applications, developers need to provide effective means of communication for distributed applications in heterogeneous network environments. In order to manage the information that needs to be shared, it is important to provide the public information exchange mechanism for the application. The commonly used Message Queuing technology is the messages queue.

Message Queue Mode of communication

    1. Point-to-point communication: Point-to-point mode is the most traditional and common means of communication, it supports a pair of one or one-to-many, many-to-many, many-to-one configuration, support tree-like, mesh and other topological structure.

    2. Multicast: MQ is suitable for different types of applications. One of the important, and growing, "multicast" applications is the ability to send messages to multiple target sites (Destination List). You can use an MQ directive to send a single message to multiple target sites and ensure that information is reliably provided for each site. Not only does MQ provide multicast functionality, but it also has intelligent message distribution, where MQ sends a copy of a message to multiple users on the same system and the list of recipients on that system to the target MQ system. The target MQ system replicates these messages locally and sends them to the queue on the list to minimize the amount of network traffic.

    3. Publish/Subscribe (publish/subscribe) mode: The Publish/Subscribe feature enables the distribution of messages to break through the boundaries of the destination queue's geographic direction, so that messages are distributed according to specific topics or even content, and users or applications can receive the required messages based on the subject or content. The Publish/Subscribe feature makes the coupling between sender and receiver looser, the sender does not have to care about the destination address of the receiver, and the receiver does not have to care about the sending address of the message, but simply sends and receives the message based on the subject of the message.

    4. Cluster (Cluster): To simplify system configuration in point-to-point communication mode, MQ provides a Cluster (cluster) solution. A cluster is similar to a domain, where communication between queue managers within a cluster does not require the creation of a message channel between 22, but instead uses the cluster (Cluster) channel to communicate with other members, greatly simplifying system configuration. In addition, the queue managers in the cluster can automatically load balance, and when a queue manager fails, the other queue manager can take over its work, thereby greatly improving the system's high reliability.

Back to top of page

Apache Kafka principle

Kafka is a messaging system that was originally developed from LinkedIn as the basis for the activity stream of LinkedIn and the Operational Data Processing pipeline (Pipeline). It has now been used by several companies as multiple types of data pipelines and messaging systems. Activity flow data is the most common part of data that almost all sites use to make reports about their site usage. Activity data includes content such as page views, information about the content being viewed, and search conditions. This data is typically handled by writing various activities to a file in the form of a log, and then periodically analyzing the files in a statistical manner. Operational data refers to the performance data of the server (CPU, IO usage, request time, service log, and so on), in general, a wide variety of statistical methods of operational data.

    • Kafka Special Terminology

The Broker:kafka cluster contains one or more servers, which are called broker.

Topic: Each message published to the Kafka cluster has a category, which is called Topic. (Physically different Topic messages are stored separately, logically a Topic message is saved on one or more brokers, but the user only needs to specify the Topic of the message to produce or consume data without worrying about where the data is stored).

Partition:partition is a physical concept, and each Topic contains one or more Partition.

Producer: Responsible for publishing messages to Kafka broker.

Consumer: The message consumer, the client that reads the message to Kafka broker.

Consumer Group: Each Consumer belongs to a specific Consumer group (the group name can be specified for each Consumer, and the default group if the group name is not specified).

    • Kafka Interaction Process

Kafka is a distributed-based message publishing-subscription system that is designed to be fast, extensible, and durable. Similar to other message publishing-subscription systems, Kafka stores information about messages in the subject. The producer writes data to the subject, and the consumer reads the data from the subject. Since the Kafka feature is distributed and distributed, the theme can be partitioned and overwritten on multiple nodes.

Information is a byte array in which programmers can store any object, supported by data formats including String, JSON, Avro. Kafka guarantees that a producer can send all messages to a specified location by binding a key value to each message. A consumer who belongs to a group of consumers subscribes to a topic through which consumers can receive all messages related to the topic across nodes, each message is sent only to one consumer in the group, and all messages with the same key value will be guaranteed to the consumer.

The Kafka design treats each topic partition as a log with sequential arrangement. Messages that are in one partition are set with a unique offset. Kafka will only keep track of unread messages, and once the message is set to read, Kafka will not be able to manage it anymore. Kafka producers are responsible for guaranteeing a certain amount of time in the message queue for the produced messages, and consumers are responsible for tracking the messages of each topic (which can be understood as a log channel) and acquiring them in a timely manner. Based on this design, Kafka can store large amounts of very small data in the message queue and support a large number of consumer subscriptions.

Back to top of page

Design ideas of using Apache Kafka system architecture
    • Example: Online games

Suppose we are developing an online web game platform that needs to support a large number of online users in real time, and that players can work together in a virtual world to accomplish each task in a collaborative way. As the game allows players to trade gold coins, props, we must ensure the integrity of the relationship between players, and in order to ensure the integrity of the players and the security of the account, we need to track the player's IP address, when a long-term fixed IP address suddenly appear between the situation, we want to be able to alert, at the same time, If there is a player holding the gold coins, props have a major change in the situation, but also to be able to timely warning. In addition, in order for the development team's data engineers to be able to test new algorithms, we want to allow these player data to enter the Hadoop cluster, which is loading the data into the Hadoop cluster.

For a real-time game, we have to do fast processing of the data stored in the server's memory, which can help in real-time issues such as alerts and other actions. Our systems are built with multiple servers, and the in-memory data includes nearly 30 accesses from every online player, including props, transactional information, and so on, which are stored across the server.

Our servers have two roles: the first is to accept user-initiated actions, such as transaction requests, followed by processing user-initiated transactions in real time and initiating the necessary alert actions based on the transaction information. To ensure that data is processed quickly and in real time, we need to keep historical transaction information in the memory of each machine, which means that we must pass data between servers, even if the machine receiving the user's request does not have the user's transaction information. To ensure the loose coupling of roles, we use Kafka to pass information (data) between servers.

    • Kafka characteristics

Several of Kafka's features are well-suited to our needs: scalability, data partitioning, low latency, and the ability to handle a large number of different consumers. In this case we can configure the same topic for login and trading in Kafka. Since Kafka supports sorting within a single topic rather than cross-topic sorting, we use the same theme to store login information and transaction information in order to ensure that users log in to the system using the actual IP address before trading.

When the user logs on or initiates a trade action, the receiving server immediately sends an event to Kafka. Here we take the user ID as the primary key of the message, the event as the value. This ensures that all transaction information and login information for the same user is sent to the Kafka partition. Each event processing service is run as a Kafka consumer, all consumers are configured to the same consumer group, so each server reads data from some KAFKA partitions, and all data from one partition is sent to the same event processing server (which can be different from the receiving server). When the event processing server reads the user's transaction information from Kafka, it can add that information to the list of historical information stored in local memory, which ensures that the event processing server invokes the user's historical information in local memory and makes an early warning, without the need for additional network or disk overhead.

Figure 1. Game Design Diagram

> For multi-threading, we created a partition for each event processing server or each core. Kafka has been tested in a cluster with 10,000 partitions.

    • Switch back to Kafka

The above example sounds a bit out of the way: first send information from the game server to Kafka, and then another game server consumer reads the information from the subject and processes it. However, such a design decouples two characters and allows us to manage the various functions of each role. In addition, this approach does not increase the load to Kafka. The test results show that even a cluster of 3 nodes can handle tasks approaching millions per second, with an average of 3 milliseconds per task from registration to consumption.

The above example when an event is found to be suspicious, send an alert flag to a new Kafka theme, and the same consumer service will read it and store the data in a Hadoop cluster for further data analysis.

Because Kafka does not track the processing of messages and consumer queues, it can handle up to thousands of consumers at the same time with minimal consumption. Kafka can even handle batch-level consumers, such as waking up a batch of sleep-per-hour consumers to process all the information.

Kafka makes it easy to deposit data into a Hadoop cluster. When you have multiple data sources and multiple data destinations, it is confusing to write a separate data channel for each source and destination. Kafka helps LinkedIn standardize the data channel format and allow each system to get data and write data once, which greatly reduces the complexity and time-consuming of data channels.

"I started by doing key-value data storage in 2008, and my project was to try to run Hadoop and move some of our processes into Hadoop," said LinkedIn architect Jay Kreps. We had almost no experience in this area, and spent weeks trying to import, export, and other events to try out the various predictive algorithms used above, and then we started the long road. "

    • The difference from Flume

Kafka and Flume Many of the functions are really repetitive. Here are some suggestions for evaluating the two systems:

    1. Kafka is a general-purpose system. You can have many producers and consumers to share multiple themes. Conversely, Flume is designed to work for a specific purpose and is sent specifically to HDFS and HBase. Flume is optimized to better serve HDFS and is integrated with the security architecture of Hadoop. Based on this conclusion, the Hadoop developer Cloudera recommends using Kafka if the data needs to be consumed by multiple applications, and Flume can be used if the data is only for Hadoop.

    2. The Flume has many configuration sources (sources) and storage pools (sinks). Then, Kafka has a very small producer and consumer environment system, and the Kafka community is not very supportive of this. If your data sources have been identified and do not require additional coding, then you can use the sources and sinks provided by Flume, and conversely, if you need to prepare your own producers and consumers, then you need to use Kafka.

    3. Flume can process data in real time in interceptors. This feature is useful for filtering data. Kafka requires an external system to help process the data.

    4. Both systems can guarantee no loss of data, either Kafka or Flume. Then, Flume does not replicate the event. Accordingly, even if we are using a trusted file channel, if the node where the Flume agent is located goes down, you will lose all of the event access until you fix the damaged node. There is no such problem with pipe features that use Kafka.

    5. Flume and Kafka can work together. If you need to transfer streaming data from Kafka to Hadoop, you can use the Flume Agent (agent) to treat Kafka as a source (source), which can read data from Kafka to Hadoop. You don't need to develop your own consumers, you can use Flume with Hadoop, HBase, use the Cloudera Manager platform to monitor consumers, and process data by adding filters.

Back to top of page

Conclusion

To sum up, Kafka's design can help us solve many architectural problems. But to use Kafka's high-performance, low-coupling, high-reliability, data-not-lost features, we need to be very knowledgeable about Kafka and our own application scenarios, and not all environments Kafka the best choice.

"Reprint" Kafka Principle of work

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.