The experience of second-turn optimization in mobile live technology

Source: Internet
Author: User

Today's mobile live technology challenges are far more difficult to live on traditional devices or computers, with complete processing including but not limited to: Audio and video capture, beauty/filter/effect processing, encoding, encapsulation, push-to-stream, transcoding, distribution, decoding/rendering/playback, etc.

Common problems with live streaming include

    • How does the host stabilize the flow in an unstable network environment?
    • How do people in remote areas watch live streaming in HD?
    • How does the live card instantly switch the line intelligently?
    • How to accurately measure live quality indicators and adjust them in real time?
    • How do different chip platforms on mobile devices encode and render video in high performance?
    • What do I do with a filter effect like beauty?
    • How to achieve playback seconds?
    • How to ensure the continuous playback of live streaming?

Basic knowledge of video, live broadcast, etc.
What is a video?

First we need to understand one of the most basic concepts: video. From a perceptual point of view, video is a fun-filled film, which can be a movie, a short film, a coherent visual impact of rich graphics and audio. But from a rational point of view, video is a structured data, in the language of engineering, we can dissect the video into the following structure:

Content Element (contents)

    • Images (image)
    • Audio
    • Meta information (Metadata)

Encoding Format (CODEC)

    • video:h.264,h.265, ...
    • AUDIO:AAC, HE-AAC, ...

Container Package (Container)

    • Mp4,mov,flv,rm,rmvb,avi, ...

Any video file, structurally speaking, is a form of composition:

    • The most basic content elements are composed of images and audio;
    • The image is processed by a video-encoded compression format (usually H. s);
    • Audio is processed in an audio-encoded compression format (e.g. AAC);
    • Indicate the corresponding meta-information (Metadata);

Finally, the container (Container) package (e.g. MP4) is packaged to form a complete video file.
If you find it difficult to understand, imagine a bottle of tomato sauce. The outermost bottle is like this container package (Container), the bottle of raw materials and processing plants and other information such as meta-information (Metadata), the lid opened (unpacked), tomato paste itself is like after the compression of the encoded content, Tomatoes and spices are processed into ketchup as coding (CODEC), while raw tomatoes and spices are the most original content elements.

Real-time transmission of video
In short, the structure of rational cognitive video helps us to understand live video. If the video is a "structured data", then the video broadcast is undoubtedly the way to transmit this "structured data" (video) in real time.
The obvious question is: how do you transfer this "structured data" (video) in real time (real-time)?
Here a paradox is: A after the container (Container) encapsulated video, must be immutable (immutable) video files, immutable (immutable) video files are already a production result, according to "Relativity", The results of this production is obviously not accurate to the degree of real-time, it is a time-space memory.
Therefore, the video broadcast, must be a "side of the production, edge transmission, side consumption" process. This means that we need a closer look at the intermediate process (encoding) of the video from the original content element (image and audio) to the finished product (video file).

Video encoding compression
Let us understand the video coding compression technology in a comprehensible sense.
In order to facilitate the storage and transmission of video content, it is often necessary to reduce the volume of video content, that is, the original content elements (images and audio) are compressed, compression algorithm is shortened to encode format. For example, the original image data inside the video will be compressed in the code format, and audio sampling data will be compressed using the AAC encoding format.
The video content is encoded and compressed, which is beneficial to storage and transmission. However, when you want to watch playback, you also need to decode the process accordingly. So there is a clear agreement between coding and decoding that an encoder and decoder can understand. In terms of video image encoding and decoding, this convention is simple:
The encoder encodes multiple images into a section of the GOP (Group of Pictures), and the decoder reads a segment of the GOP when it is decoded to decode the screen before rendering the display.

The GOP (Group of Pictures) is a contiguous set of images consisting of an I-frame and several b/p frames, which are the basic unit of video image encoder and decoder access, and the order of the sequences will be repeated until the end of the image.

I-frames are internally encoded frames (also called keyframes), P-frames are forward-predicted frames (forward reference frames), and B-frames are bidirectional interpolated frames (bidirectional reference frames). In a nutshell, I-frames are a complete picture, while P-frames and B-frames record changes relative to the I-frame.
If there is no I-frame, P-frames and B-frames cannot be decoded.

To summarize, a video, the image part of the data is a set of GOP, and a single GOP is a set of i/p/B-frame image collection.
In such a relationship, Video is like an "object", the GOP is the "molecule", and the i/p/B-frame image is the "atom".
Imagine what it would be like if we transferred an "object" to transfer one of the "atoms" and transfer the smallest particles at the speed of light, so what would it be like to be perceived by the human creature's naked eye?

What is a live video?
It is not difficult to open the brain hole, live is such an experience. Video broadcast technology is the smallest particle (i/p/B-frame, ...) of the video content, based on the time series, a technique that transmits at the speed of light.
In short, live streaming is the process of transferring each frame of data (Video/audio/data frame) to a time-series label (Timestamp). The transmitting end continuously collects the audio and video data, passes through the encoding, the packet, pushes the stream, then passes through the relay distribution network to spread spreads, the playback end continuously downloads the data and decodes the time sequence to play. In this way, "side production, edge transmission, side consumption" of the live process.
After understanding the two basic concepts of video and live streaming , we can now get a glimpse into the business logic of live streaming.

Business logic for live streaming
The following is a streamlined one-to-many live business model, along with protocols across tiers.

The differences between the protocols are as follows

These are some of the basic concepts of live technology. Let's look further at the live performance indicators that affect people's visual experience.
Live performance metrics that affect the visual experience
Live streaming The first performance indicator is latency, which is the time it takes for data to be sent from the information source to the destination.

According to Einstein's special theory of relativity, the speed of light is the highest velocity that all energy, matter, and information movements can achieve, and this conclusion sets a limit on the speed of transmission. So, even if we feel real-time in the eye, there is actually a certain delay.

The Rtmp/hls is an application-layer protocol based on TCP, a TCP three-time handshake, a four-time wave, and every round-trip from slow start, plus a round trip time (RTT), which increases latency.

Secondly, based on TCP packet retransmission, network jitter may lead to packet loss retransmission, and may indirectly lead to increased latency.

A complete live broadcast process, including but not limited to the following: acquisition, processing, encoding, packet, push stream, transmission, transcoding, distribution, Lahue, decoding, playback . The lower the delay, the better the user experience, from the push-to-play, then the intermediate forwarding link.
The second live performance index lag , refers to the video playback process appears the picture lag frame, let people obviously feel "card". The number of playback lag times per unit time is called the lag rate .
The factors that cause the lag is the possibility of sending data interruption at the push-to-stream end, or it may be the public network transmission congestion or network jitter Anomaly, or the decoding performance of the terminal equipment is too poor. The fewer or less frequent lag, the better the user experience.
The third live performance indicator the first screen time, refers to the first click to play, the naked eye to see the screen is waiting. Technically, it takes time for the player to decode the first frame to render the display screen. Usually said "seconds open", refers to the click Play, within a second can see the playback screen. The faster the first screen opens, the better the user experience.
As above three live performance indicators, respectively, corresponding to a low-latency, high-definition smooth, fast seconds open user experience demand. Understanding these three performance metrics is critical to optimizing the user experience for mobile live apps.
So what are the common pits in the mobile live scene?
According to the experience summarized in practice, the pit of video live on mobile platform can be summed up in two aspects: device difference, and the technical test brought by the network environment under these scenes.

The pit and evasive measures of mobile live scene
Coding differences on different chip platforms

On the IOS platform, whether hard-coded or soft-coded, because Apple is a company factory, there is little difference in coding due to different chip platforms.
However, on the Android platform, the Android Framework SDK provides the MEDIACODEC encoder on different chip platforms, the difference is very large, different manufacturers use different chips, while the different chip platforms on Android MEDIACODEC performance slightly The cost of achieving full platform compatibility is often not low.
The other is the Android mediacodec hard-coded level of the encoding quality parameters are fixed baseline, so the quality is usually also general. Therefore, under the Android platform, the recommendation is to use soft, the advantage is that picture quality can be regulated, compatibility is also better.

How do low-end devices acquire and encode at high performance?

For example, the Camera capture output may be a picture, a picture of the volume is not small, if the frequency of acquisition is very high, the encoding frame rate is very high, each picture has been encoder, then the encoder may be overloaded.
At this time, you can consider the pre-coding, without affecting the quality of the premise (before we talked about the microscopic meaning of frame rate), to make a selective drop frame , so as to reduce the power consumption cost of the coding process.
How to guarantee the high-definition smooth push stream under the weak network


Under the mobile network, it is usually easy to encounter network instability, the connection is reset, the disconnection is repeated, on the one hand frequent reconnection, the establishment of a connection requires overhead. On the other hand, especially when gprs/2g/3g/4g switching occurs, the bandwidth may bottleneck. When the bandwidth is not enough, high frame rate/high bit rate content is more difficult to send out, this time requires variable bitrate support.
That is, in the push flow end, can detect network status and simple speed, dynamic to switch the bitrate, in order to ensure the network switching flow smoothly.
Secondly, the logic of encoding, packet, push flow can also be fine-tuned, you can try to selectively drop frames, such as the priority to lose the video reference frame (do not throw I-frames and audio frames), which can also reduce the data to be transmitted, but at the same time to achieve no impact on the quality and audio-visual fluency purposes.
Need to differentiate the status and business status of live streams
Streaming is a stream of media, the interaction of the APP is the API signaling flow, the state of the two can not be confused. In particular, the status of the live stream cannot be judged based on the API state of the APP's interaction.

The above is a few common pits and evasive measures in the mobile broadcast scene.
Other optimization measures for mobile live scene
First, how to optimize the speed of opening to achieve the legendary "second open"?
As you may see, some mobile live apps on the market are open very quickly. And some mobile live apps, click to play after a few seconds to play. What causes such a difference?
Most players are given a finished GOP to decode the playback, and the player based on the FFmpeg transplant even needs to wait for the time stamp of the tone to be synchronized before playing (if there is no audio in a live stream only the video is equivalent to waiting for the audio timeout before the screen can be played).
"Sec-open" can be considered in the following aspects:
1. Rewrite the player logic to give the player a display after the first keyframe has been taken.
The first frame of the GOP is usually the keyframe, which can reach "first frame seconds open" due to the less data being loaded.
If the live server supports GOP caching, it means that the player can get the data immediately after the connection is established with the server, eliminating the time to transfer the source across geographies and across carriers.
The GOP represents the period of a keyframe, which is the distance between two keyframes, the maximum number of frames in a frame group. Suppose a video has a constant frame rate of 24fps (that is, 1 seconds, 24 frames), and the keyframe period is 2s, then a GOP is 48 images. In general, you need to use at least one keyframe per second video.
increase the number of keyframes to improve picture quality (GOP is usually a multiple of FPS), but increase bandwidth and network load。 This means that the client player downloads a GOP, after all, the GOP has a certain amount of data, if the player network is not good, it may not be able to quickly download the GOP within the second level, thereby affecting the perception experience.
If you cannot change the player behavior logic for the first frame seconds, the live server can also do some trickery processing, such aschange from cached GOP to cache double Keyframe (reduces number of images), which can greatly reduce the volume of content that the player loads on the GOP to transmit.
2. Optimization in the APP business logic level.
For example, do a good job of DNS resolution (save dozens of milliseconds), and in advance to do the Speed Line selection (playable optimal line). After this preprocessing, the download performance is greatly improved when the play button is clicked.
On the one hand, performance optimization can be done around the transport plane, and on the other hand, business logic can be optimized around customer playback behavior. The two can be effectively complementary to each other, as a second-open optimization space.
second, beauty and other filters how to deal with?
In the mobile live scene, this is just a need. Without the beauty function of the mobile live APP, the host basically do not love to use. After the capture screen, the data to the encoder before the data source callback to the filter handler, the original data after the filter mirror processing, and then sent back to encode the encoder.
In addition to the mobile side can do the experience optimization, live streaming media server architecture can also reduce latency. For example, the streaming server actively pushes the GOP to the Edge node, the edge node caches the GOP, and the player can load quickly to reduce the back-to-source delay.

Next, can be close to the terminal near processing and distribution

Third, how to ensure that the continuous playback of live smooth?
"Seconds to open" solution is the first time the live broadcast playback experience, how to ensure the continuous playback of the screen and sound audio-visual fluency? Because, after all, a live broadcast is not an HTTP-like one-time request, but a long connection at the Socket level is maintained until the host actively terminates the push stream.
We've talked about the definition of lag: that is, when the frame is playing, it triggers people's visual perception. Regardless of the performance difference of the terminal equipment, for the reasons of the network transmission level, we look at how to guarantee a continuous live broadcast not to Dayton.
This is actually a fault-tolerant problem when the transmission network is unreliable during live broadcast. For example, the player temporarily off the network, but quickly recover, for this scenario, if the player does not do fault-tolerant processing, it is difficult to avoid the black screen or reload playback phenomenon.
in order to tolerate this network error, and to make the end user unaware, the client player may consider building a FIFO (first in, out) buffer queue, the decoder reads the data from the playback cache queue, and the cache queue continuously downloads data from the live server. In general, the capacity of the cache queue is in time (such as 3s), when the playback network is unreliable, the client buffer can play a "network-free" transition role.
Obviously, this is only a "delaying tactic", if the live server edge node fails, and at this time the client player is a long connection, in the inability to receive the connection to the end of the signal, the client's buffer capacity is no longer useless, this time it is necessary to combine the client business logic to do scheduling.
It is important that the client is combined with the service side to do the precise dispatch. Before initializing the live stream, for example, based on the IP location and the operator's precise scheduling, the edge Access node with optimal line quality is allocated. In the process of live streaming, the quality data such as frame rate feedback can be monitored in real time, and the dynamic adjustment line based on the quality of live stream.

Q & A
1. What is the frequency of the keyframe setting? Is there a dynamic setting based on access? It will be difficult to do so long as the first screen seconds.

Xu Li: The longer the keyframe interval, the longer the GOP, the more theoretical the picture is in HD. However, when generating HLS live, the minimum cut granularity is also a GOP, so for interactive live streaming, the GOP setting is usually not recommended too long. 2 key frame intervals are normally available for live streaming. For example, the frame rate is 24fps, then the 2 key frame interval is 48fps, the GOP is 2s.
2. Seven cows This live is used by the net-stay acceleration? Have you encountered any pits?
Xu Li: Seven cattle in the live area is mainly self-built nodes, but also support the integration of many third-party CDN service providers, a variety of line combinations to provide customers with better quality service. In the process of cooperation with third-party CDN and other issues encountered in the opportunity to do more granular exchange and sharing.
3. Does RTMP live stream have any acceleration in addition to optimizing the line?
Xu Li: physical optimization of the line, logical optimization strategy, such as the selective drop frame, does not affect the encoding quality of the premise of reducing the transmission volume.
4. OBS push Stream, the playback side of HLS video/audio is out of sync what is the link problem? How to optimize?
Xu Li: It may be the acquisition of the problem, if it is the acquisition side of the coding link will appear in different steps, you can be on the receiving server to do the time stamp synchronization, this is the global proofing. If it is the performance problem of the playback side decoding, then need to adjust the playback logic, such as to ensure that the sound and time stamping strong consistency under the premise of a selective loss of a frame.
5. PPT The first few pages of a concept seems wrong, I frame is not a keyframe, the IDR frame is. The IDR frame is an I-frame, but I-frames are not necessarily IDR frames. Only the IDR frames are reentrant.
Xu Li: Chinese translates the I-frame into keyframes, but since the IDR frame is mentioned, you can expand the description. All IDR frames are I-frames, but not all I-frames are IDR frames, and IDR frames are a subset of I frames. I-frame is strictly defined as an intra-frame encoded frame, because it is a full frame compression encoded frame, usually with an I-frame to represent "keyframe". IDR is an "extension" based on the I-frame, with the control logic, the IDR image is an I-frame image, when the decoder decoded to the IDR image, the reference frame queue is immediately emptied, the decoded data are all output or discarded. Re-finds the parameter set and begins a new sequence. This allows you to get a chance to resynchronize if there is a significant error in the previous sequence. The image after the IDR image is never decoded using the data from the image before the IDR.
6. Have you researched the Nginx rtmp module, why not use it, what is the evaluation of it?
Xu Li: There have been research, Nginx_rtmp_module is a single process multi-threading, non-Go This lightweight thread/association in the way of concurrent natural semantics of writing flow business. Nginx's original code size is large (about 160,000 lines, but not many of the functions associated with the live broadcast business). and mainly by writing nginx.conf to do the configuration of tenants, usually single-tenant can, but business scalability is not very flexible, to meet the basic requirements, do not meet advanced features.
7. What are the open source software? is x264 used for encoding? Live Streaming server are you developing or open-source?
Xu Li: Live server with Go development, mobile encoding first hard-coded, soft-coding with x264
8. Ask if you have already done video compression or need to re-develop based on OBS when using OBS to push to Nginx_rtmp_module?
Xu Li: OBS has done coding compression, no need to develop.
9. Live video you want to seamlessly insert an ad in the HLS stream TS file, have a question to ask: 1, the resolution of this TS must be consistent with the previous video stream? 2. Is the PTS timestamp incremented with the previous TS?
Xu Li: 1, can not be consistent. In this case the two video is completely independent state, can not have any relationship, only need to insert the discontinue tag, the player after the recognition of this tag to reset the decoder parameters can be played seamlessly, the screen will be very smooth switch. 2, do not need to increment. For example, video A is live, play to PTS at 5s, insert a video B, you need to insert a discontinue, then insert B, and so on after B play, then insert a discontinue, and then insert a, this time A's PTS can and before increment, can also be inserted in the middle of the time of the B is offset, usually do on-demand and time-lapse PTS will be continuously incremented, live will be counted on the length of B.

The experience of second-turn optimization in mobile live technology

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.