20 times times more experience on adult website performance [Python]

Last Update:2014-12-15 Source: Internet

Author: User

Tags epoll website performance

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The sex industry is a big business. There are not many websites on the internet that can match the largest pornographic sites.

It's hard to handle this huge traffic. More difficult, many of the content available on pornography sites are low latency live streaming rather than simple static video. But for all the challenges I've encountered, I've seldom seen what developers have done with them. So I decided to write my own experience in this area.

What's the problem?

A few years ago, I was working on a Web site that ranked 26 of the world's traffic--not the list of pornographic sites, but the world rankings.

At the time, the site responded to requests for pornographic streaming media through the rtmp (Real-time Messaging Protocol) protocol. More specifically, it uses Adobe's FMS (Flash media Server) technology to provide users with real-time streaming media. The basic process is this:

User requests access to a live streaming media
The server responds by an rtmp session, playing the requested video clip

For some reason, FMS is not a good choice for us, first of all it costs, including the purchase of the following two:

Buy windows copyright for each server running FMS
About $4000 for one FMS-specific copyright, due to our size, we have to buy hundreds of copyrights and increase every day.

All these costs are beginning to accumulate. Aside from the cost, FMS is a relatively lame product, especially in terms of its functionality (I'll talk about this later in detail). So I decided to abandon the FMS and write my own rtmp parser from the beginning.

Finally, I finally improved our service efficiency by about 20 times times.

Start

There are two core issues involved: first, rtmp and other Adobe protocols and formats are not open, and it's hard to use them. If you know nothing about the file format, how can you reverse engineer it or parse it? Fortunately, some reverse engineering attempts have been made in the public domain (not by Adobe, but by osflash.org, which cracked some protocols), and our job is based on these results.

Note: Adobe later released the so-called "spec sheet", which is nothing new than what is disclosed in reverse engineering wikis and documents not provided by Adobe. The quality of the specifications they gave to the specification was absurd, and it was almost impossible to use their library through the manual. Moreover, the agreement itself often seems to be deliberately made misleading. For example:

They use 29-byte shaping numbers.
They use the format of the low address to hold the most significant byte (big endian) in all places on the protocol header, except in the format of a field (and not marked) with a low address holding the least significant byte (little endian).
When they transmit 9K of video, it is basically pointless to compress the data to reduce the amount of computing power, because they are so frustrating that they reduce a few or more bytes, which is negligible for such a file size.

Also, RTMP is highly session-oriented, which makes it virtually impossible for the convection to be multicast. Ideally, if multiple users are required to watch the same live video stream, we can directly return a pointer to a single session and transfer the video stream in that session (this is the concept of multicast). But with rtmp, we must create a completely new instance for each user who requires access to a particular stream. It's a total waste.

my way out.

With that in mind, I decided to repackage and parse the typical response stream into FLV "tags" (where "tags" refer to a video, audio, or metadata). These FLV tags can be transferred smoothly under rtmp.

The benefits of such a method are:

We just need to repack the stream once (repackaging is a nightmare, because of the lack of specs and the disgusting protocol mentioned earlier).
By applying an FLV header, we can smoothly reuse any stream between the clients, while the internal FLV tag pointer (with a displacement value that declares its exact position within the flow) can access the real content.

I started with the C language I was most familiar with at the time. After a while, the choice became troublesome, so I started learning Python and porting my C code. The development process was accelerated, but after doing some demo versions, I quickly ran into a resource exhaustion problem. Python's socket processing is not suitable for dealing with these types of situations, specifically, we found that in our own Python code, each action has made multiple system calls and context switches, which adds enormous overhead to the system.

Improved performance: Mix python with C

After combing the code, I chose to port the most critical functions into a Python module that was written entirely in C. This is basically the underlying thing, specifically, it leverages the kernel's epoll mechanism to provide an O (log n) algorithm complexity.

In the case of asynchronous socket programming, there are mechanisms that provide information about whether a particular socket is readable/writable/error-prone. In the past, developers could use a select () system call to get this information, but it was difficult to use on a large scale. Poll () is a better choice, but it's still not good enough because you pass a lot of socket descriptors every time you call.

The magic of Epoll is that you only need to register a socket and the system remembers the specific socket and handles all the internal clutter. This will not be the cost of passing parameters at each invocation. and its size is auditorium, it returns only those sockets you care about, and it's not a trivial matter to check for any event with a byte mask from the 100,000 socket descriptor list in comparison to other technologies.

However, we also paid the price for performance improvements: This approach uses a completely different design pattern than before. The previous method of the site was (if I remember correctly) a single raw process that was blocked when it was received and sent. I developed a set of event-driven scenarios, so in order to adapt to this new model, I had to refactor other code.

Specifically, in the new method, we have a main loop that handles receiving and sending as follows:

Incoming data (as a message) is passed to the RTMP layer
The RTMP packet is parsed and the FLV tag is extracted from it
The FLV data is transferred to the cache and the multicast layer, where the stream is organized and populated into the underlying transport cache
The sender saves a structure for each client, contains the last-sent index, and transmits as much data to the client as possible

This is a scrolling data window and contains some exploratory algorithms that discard some frames when the client is too slow to receive. The overall operation is very good.

system-level, architectural, and hardware issues

But we have another problem: The kernel context switch becomes a burden. As a result, we choose to send it every 100 milliseconds instead of in real time. This makes it possible to aggregate small packets and avoid the explosion of context switching.

Perhaps the bigger problem is the server architecture: We need a server cluster with load balancing and fault tolerance, after all, it's not fun to lose users because of the server's ability to function abnormally. In the beginning, we adopted the method of full-time server, which specifies that a "manager" is responsible for generating and eliminating the stream of play by predicting demand. This method has failed beautifully. In fact, every method we tried was a pretty obvious failure. Finally, we used a relatively violent method to randomly share the stream of playback among the nodes in the cluster, making the traffic basically balanced.

This method is effective, but there are some deficiencies: although it is generally handled well, we also run into a bad performance when all site users (or a significant percentage of users) watch a single broadcast stream. The good news is that, in addition to a marketing campaign (marketing campaign), this situation has never been seen again. We deployed another set of separate clusters to handle the situation, but the real situation was that we analyzed it first and felt it was not justified to sacrifice the experience of a paid user for a campaign, and in fact, the case was not a real event (though it would be nice to handle all the imagined things).

Conclusion

Here are some statistics on the final results: daily traffic in the cluster at peak time is about 100,000 users (60% load), on average 50,000. I managed 2 clusters (Hungary and the United States), each with about 40 servers sharing this load. The total bandwidth of these clusters is approximately Gbps, which is approximately ten Gbps when the load peaks. Finally, I tried to make each server easily available with up to ten Gbps bandwidth, which is equal to a single server that can withstand 300,000 of users watching video streams at the same time.

The existing FMS cluster contains more than 200 servers, I only need 15 units to replace them, and only 10 of them are truly serving. This is equal to 200 divided by 10, which equals 20 times times the performance improvement. Probably the biggest thing I've learned about this project is that I shouldn't be stuck in the difficulty of learning new skills. Specifically, Python, transcoding, object-oriented programming, these are the concepts that I lacked professional experience before doing this project.

This belief, and the confidence to achieve your own plan, will bring you great rewards.

"1" later, when we put the new code into production, we encountered a hardware problem because we used the old sr2500 Intel Architecture server, because their PCI bus bandwidth is too low to support the Ethernet card of the Gbit. We have to use them in the 1-4X1 Gbit Ethernet pool (the performance of multiple network cards is aggregated into a single virtual network card). In the end, we have some newer sr2600 i7 Intel architecture servers that achieve a non-performance loss of up to ten Gbps bandwidth through fiber optics. The results of all of the above summaries are calculated based on such hardware conditions.

English Original: GERGELY KALMAN, compile: @ old yards of private plots
Link: http://blog.jobbole.com/39323/

20 times times more experience on adult website performance [Python]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More