Distributed Bullet Screen Service architecture

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Today, a simple record of the screen server design ideas, we hope to help.

Business Features

The pop-up is typically less than a scene, and if a room has a 10W audience, the barrage may only be 1000 times per second, but the broadcast barrage will need to be 10W times for all viewers.

Stand-alone model

To push messages, long connections are almost certainly an option.

Each room has several spectators, all of which are connected to 1 service processes.

When the barrage is submitted, the room is located in the room to find all the online users, the loop to push the barrage to them.

Assuming that the 1 service process's message network throughput capacity is 500,000 times/second, then a 100,000 viewer room, 5 times per second to submit the barrage will reach the service end limit performance.

Multi-machine model

Suppose a live stream still has 100,000 people online, hoping to solve the problem of performance bottleneck of 5 times per second, it is easy to think about whether to scale out.

Assuming there are now 2 servers, 100,000 people are evenly connected to 2 servers, or 50,000 people online.

Now any user sends 1 barrage to a server, then a server pushes 50,000 times, and forwards the barrage to B server, B server only need to push 50,000 times.

Suppose to reach the limit of a, B server, only need to have 10 barrage per second ... So are the benefits of scale-out so small?

There is clearly a more obscure secret.

Batch Model

In fact, if each barrage is sent as a separate TCP packet, then the network card will quickly reach the bottleneck, because almost every packet has to be delivered to the network card after a kernel outage, which is not friendly to the NIC.

According to the actual experience, the million Gigabit network card per second of the packet volume of about 1.002 billion, before I say 500,000 times/second actually said less than the actual situation.

If you can reduce the number of network packets, then you can solve the bottleneck of NIC, so as to break the 500,000 per second broadcast of the bottleneck of the barrage.

The idea is batch, can be sent immediately to the audience of the bomb screen cache, every interval of 1 seconds of these cached barrage as a whole, sent to the various viewers of the TCP connection.

In such a realization, whether it is 50 bullets per second or 10,000 bullets submitted to the server, the number of push network calls per second is 100,000, that is, in seconds for the aggregation of the barrage, so that the number of network calls only related to the number of online viewers.

is still the previous server performance (limit 500,000 times per second network calls), then can support 500,000 of viewers at the same time online, 1 seconds per second of the cached barrage as a whole to send to 500,000 viewers, that is, this second only 500,000 times the network calls, within the range of the NIC.

In the network this piece, by increasing the size of a single transmission packet can reduce the number of packets, increase the bandwidth utilization, the number of bounces per second delivered is not affected (although the number of packets per second can be issued less, but a single package has a lot of bullets, the overall remains unchanged), that is, delay in the throughput.

What's next?

From the Bilibili of the barrage service pressure measurement data, a single server per second issued 35 million + bomb screen, hosting about 1 million of the online audience, a single send packed barrage of about 35 million/1 million = 35 appearance, that is, 35 bullets per second submitted to the service side, Then package the broadcast to all viewers, and bandwidth becomes a bottleneck.

In the case of the number of spectators, even if the 10,000 barrage per second, then the bottleneck is only the bandwidth (push packaging becomes larger) and CPU (aggregated message), as long as the number of viewers into 800,000, then the bandwidth will have more redundancy, the CPU will be correspondingly reduced.

The bottleneck of the entire barrage system is no more than the number of bounces per second, but the number of spectators decides:

Number of network sends per second = number of spectators = number of barrage

This is a very important conclusion! By increasing the number of servers to divide the audience, then each server per second of the push can be reduced, as long as the cluster has a large enough export bandwidth, then everything is not a problem!

Architecture

Design systems are fastidious, so the complexity is minimal.

In fact, the essence of the screen is IM system, IM system generally support 1 to 1,1:n chat mode.

The IM system has some conventions in terms of extensibility, and the following is a simple theory based on Bilibili's bullet-screen architecture.

Comet is a gateway, stateless, responsible for client long connections and messaging.

Logic is a logical server, stateless, doing business logic, maintaining a single high-performance comet logic.

Router has a state (memory state), logic holds the user's session information in a router through the user UID consistency hash.

A UID can enter multiple rooms, establish multiple connections to different comet, and each live room may have users online on different servers.

The user invokes the logic service directly through the HTTP protocol to the room, and logic is sent directly to Kafka, which is broadcast to all comet by the job service consumption.

The user can also send a message, which may be sent directly to a person, to a room, to all rooms.

Send a message to an individual. Logic queries router to get which of the servers on which the UID is on. A record is then pushed to the Kafka and the job service broadcasts the message to the room on those servers, so that the user can see the push in any of the N live rooms.

Sending all rooms and sending a room is similar to the job broadcast to all comet, and then each comet sends messages to all users.

Performance

In fact, push has a programming problem, is to push the user volume is very large, and users frequently on-line and offline, so the online user collection is locked.

Push is going to traverse the collection, so it's very contradictory.

This problem can only be achieved by splitting the collection, each of which maintains only a subset of users, with the UID hash shard.

When you push a message, walk through each small set of locks. Pushing a room is similar, just walk through each small collection and find the user in the small collection that corresponds to the room.

Extended Reading

Bilibili's barrage system is open source, interested in detailed analysis of his code, using the Golang standard library RPC, additional reliance on Kafka, the overall design is not complicated.

github:https://github.com/terry-mao/goim/

Blog: http://geek.csdn.net/news/detail/96232

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.