Reliability testing of Kafka messages--choice of scenarios for the live broadcast business

Source: Internet
Author: User
Tags live chat

Transferred from: http://blog.csdn.net/bailove/article/details/44240303

Business Scenarios

To the crazy live interactive platform, there are millions of people on the line each day, hundreds of thousands of people at the same time participate in interactive live chat. The user's login, exit and various interactions between users such as chat, gift, attention, vote, grab the sofa and other events will produce a lot of news. These messages have instant bursts, such as the premiere of popular live broadcasts, the climax of live performances, and so on. And the user's gift, stars, speakers, sofas and other such messages are not allowed to lose, must be 100% delivery. This requires the support of a high-performance, high-reliability, stable and scalable message service platform. It requires that messages be sent to the server at least once, in the event of a disaster such as network pressure and server downtime. We need to test the message service Kafka that the big data platform has already provided.

test Environment

Cpu:24 Intel (R) Xeon (r) CPU e5-2620 v2 @ 2.10GHz

Memory: 32G

Number of disks: 1 (normal SATA disk)

Kafka version: 0.8.2

Cluster Size: 4 nodes

Number of topic copies: 3

Topic number of shards (partition): 4

Disaster simulation

    1. One of the nodes is down during message sending, or two nodes are down at the same time (up to 2 simultaneously, because the number of replicas is 3)

    2. Frequent downtime restarting one of the nodes

    3. Alternate outage Restart one or two broker but guaranteed not to have 3 node outages at the same time

(PS: At the same time, there are three node downtime, in nearly 1 years of operation we have not encountered, up to two point of downtime.) )

Test results

Conclusion

Synchronous _ack mode ensures that messages are sent at least once to the server. In the case of three backup, the Kafka cluster can provide service to both production and consumer when it is not down to two units at the same time. However, the use of this mode may be due to network or service problems caused by repeated data transmission, so the consumption of consumer operations need to be idempotent.

Synchronous _ack Non-batch

Synchronous _ack in the case of batch processing, the single process is more efficient than the low transmission rate of 500kb/s, increasing the number of processes can improve the overall transmission efficiency.

With this pattern, data loss or performance is lost:

①:kafka server at the same time down three (replica number is 3), because there is no leader service caused producer generated data is not written Kafka, data loss.

②: Consumer End Program outage, may result in business aspects of statistical errors, manifested as data loss. At this point the data is not really lost but the consumer consumes the part of the message did not finish the business logic performance of the data loss. The consumer side of this situation, need to have data rollback, re-consumption, fill the data mechanism.

Sync _ack_batch

Synchronous _ack_ Batch (200) in the case of the basic asynchronous _noack do not do batch mode send rate. Single-Client 7m/s (this rate increases if the batch volume is increased), the number of clients is linearly increased, and the bottleneck is limited by network bandwidth.

With this pattern, data loss or performance is lost:

①: Ibid.

②: Ibid.

③: Because batch is sent, producer will cache a portion of the data if the producer outage causes batch to be in memory and messages that have not yet been sent are lost. For this case the producer end needs to do the message persistence, timed to do offset checkpoint, will have persisted the message to Kafka, if producer unexpectedly down, then recovers the data resend from the checkpoint.

Reference scheme

For the crazy business scenario, we can probably divide the message into the following three types:

1. Large amount of data, and allow a small amount of data loss. For example, the user enters the channel, chats, the upper and lower line and so on uses the asynchronous _noack mode

2. Data volume is small, data is not allowed to be lost. For example, user Golden Horn, Rob Sofa, guardian, etc. using synchronous _ack mode

3. Data volume is large, data is not allowed to be lost. For example, users using the star, gifts, voting and other producer using synchronous _ack_batch mode

The different sending strategies are customized for different kinds of messages, so that the message can be sent to the Kafka server reliably and efficiently.

Of course, to ensure that the business is reliable, in addition to the Kafka service side of the message reliability and performance assurance, the client (production and consumer) also to achieve data persistence, data verification and recovery, idempotent operations and transactions.

In addition, operation is also an essential link, monitoring the client and service side of the state, the abnormal situation of rapid alarm, timely processing to ensure the stability of the Kafka cluster. Follow-up we will continue to launch Kafka, storm operations related articles, and look forward to!

------Author's resume-----------

Chunling, a graduate of the University of Posts and Telecommunications, is currently working in Youku Tudou's big Data base platform for the optimization and operation of the group's real-time computing platform. Personal focus on Storm, Kafka bottom technology research.

Reliability testing of Kafka messages--choice of scenarios for the live broadcast business

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.