Hyperthreading technology and parallel computing Analysis in H.264 Encoder

Source: Internet
Author: User
Hyperthreading technology and parallel computing Analysis in H.264 Encoder
 
[Author: Wang Yu, Lin Tao, Tongji University]
 
H.264 is a new generation of video compression standard jointly developed by ITU-T and ISO. Compared with the previous standards, the computing accuracy and some specific algorithms are greatly improved. These improvements enable H.264 to provide higher compression ratios and lower bit rates. However, we should see that the performance improvement is at the cost of a larger amount of computing. Although we can use MMX, SSE, and sse2 to optimize software encoders running on the PC to improve performance by 2-3 times, but in the face of real-time video processing or other situations, encoder needs to be faster. With the continuous development of computer hardware and software, we can use multi-processor or hyper-Threading Technology in the Pentium 4 type CPU for thread-Level Parallel Processing to further improve the speed of the encoder.

To take advantage of hyper-threading processors, we also need application support. The multi-task, multi-thread, and Cluster implemented by OpenMP application programming interfaces facilitate the compilation of multi-thread programs.

Hyper-Threading Technology)

Generally, the method to improve the processor performance is to increase the clock speed and cache capacity. However, these two methods are limited in a certain period due to the impact of the process. As a result, processor vendors hope to improve performance through other methods, such as well-designed extended instruction sets, pipeline operations, and more accurate branch prediction algorithms.

Hyper-Threading Technology is also a way to improve the efficiency of the processor. Simply put, the hyper-threading function divides a processor into two "virtual" processors internally, and the operating system considers itself running in a multi-processor state. This is a technology similar to multi-processor parallel operation, but it only adds an architecture Command Center (AS) to a single processor. In fact, as is some general registers and pointers. The two as instances share a set of execution units, cache, and other structures, so that the two as instances can work in parallel to Improve the efficiency when only about 5% of the core size is increased.

Figure 1 illustrates the differences between a single-processor system (a) and a dual-processor system (B) with hyper-Threading Technology.

As you can see, in figure 1 (B), the dual processor excludes its own registers, caches, arithmetic logic units, and other resources. In Figure 1 (a), the super thread mechanism is to replace the architecture command center of a processor into two, which makes the operating system think that it is communicating with the two processors, however, these two architecture command centers share the execution resources of the processor, such as operational logic units. The architecture command center tracks the execution status of each program or thread. As a result, the operating system will assign the worker threads to the two logical processors for execution, each execution unit of this CPU serves two "command processing centers" in the same time, which reduces idle time and improves efficiency.

In Figure 1 (a) using htt processors, a physical processor is considered as two logical processors. The two logical processors are no different for the operating system and the two processors in figure 1 (B. Hyper-threading technology allows a single CPU to process data commands in parallel as two CPUs. According to Intel, hyper-threading technology can improve performance by more than 30%.

2. OpenMP

To make full use of the advantages of hyper-Threading Technology, we also need programs that support multithreading.

OpenMP is a portable and scalable standard that provides a simple and flexible interface for programmers to conveniently add parallel mechanisms to the multi-processor platform with shared memory. Computer hardware, software, and tool manufacturers such as Intel, Dec, Silicon Graphics, Kuch & Associates, and IBM jointly defined OpenMP standards 15 years ago. OpenMP supports parallel programming of shared memory using C/C ++ and FORTRAN in all architectures, including the architecture based on Microsoft WindowsNT and UNIX operating systems. OpenMP also uses compiler commands and library functions to help parallel application programmers use C/C ++ and FORTRAN to create multi-threaded applications.

OpenMP is a set of Compilation instruction statements, library functions, and environment variables. It explicitly instructs the compiler how to insert threads somewhere in the program.

The program compiled with OpenMP adopts the fork-join parallel execution mode during runtime. A program starts with a single process, called the main thread for execution. When the main thread runs in sequence to the first parallel block structure, a thread team is generated. The original main thread becomes the main thread of the thread team. All the statements in the program that are surrounded by parallel blocks (including the called subroutines in the block) are executed in parallel in the thread team until the threads in the thread team stop after the execution of the parallel blocks, the main thread continues to execute. A program can define any number of parallel blocks. Therefore, a program can be split and merged several times during execution.

1. Instruction format

The OpenMP command is a C/C ++ annotation identified by a special identifier. Supports OpenMP C/C ++ compilers to activate and compile all OpenMP compilation commands through command line parameters.

# Pragma OMP directive-name [clause [[,] clause] ......] New-line

In C/C ++, commands must start with '# pragma omp.

2. Parallel for Structure

When OpenMP encounters parallel for, a thread group is generated, and the for loop is allocated to the thread group for concurrent execution. OpenMP will decide how many threads need to be generated, and how to synchronize these threads and when to terminate them. We only need to tell OpenMP which loop requires multithreading.

The following is a simple example:

# Pragma OMP parallel

For (I = 0; I <numpixels; I ++)

{

Pgrayscalebitmap [I] = (unsigned byte)

(Prgbbitmap [I]. Red * 0.299 +

Prgbbitmap [I]. Green * 0.587 + prgbbitmap [I]. Blue * 0.114 );

}

3. Parallel sections Structure

Sections is a cycle-independent indicator for task allocation. The section contains several sections. program segments enclosed in each section, which can be executed in parallel by different threads.

# Pragma OMP parallel sections

{

# Pragma OMP Section

{

TASKA ();

}

# Pragma OMP Section

{

Taskb ();

}

# Pragma OMP Section

{

Taskc ();

}

}

In the preceding example, all sections are allocated in the thread group. A section is executed only once by one thread in the thread group, but is synchronized with other sections.

4. Run database functions

These database functions can be used to control or query parallel execution environments. The following lists several commonly used Runtime library functions related to the execution environment.

Omp_set_num_threads (integer variable) sets the total number of threads used by the program

Omp_get_num_threads gets the total number of threads used by the program

Omp_get_thread_num get the thread Number of the current thread

Omp_get_num_procs obtains the number of available processors in the system.

Omp_set_nested (scalar logical expression) if the scalar logical expression is true, multi-level parallel mode of nested loops is allowed.

5. Environment Variables

The most important environment variable is to set the total number of threads required for the OpenMP execution, that is, omp_num_threads.

Set omp_num_threads = 4. However, when calling the library function omp_set_num_threads (), a new number of threads will overwrite the original environment variable value.

Parallel Computing Analysis in H.264

Similar to the previous encoding methods, H. 264 is still a mixed Encoding Method Based on Motion Compensation plus transform encoding. It continues to use the previous excellent technologies, but h. 264 introduces many advanced and practical video encoding technologies, which are concentrated in the following aspects: multi-mode motion estimation, intra-frame prediction, multi-frame prediction, and unified VLC, 4 × 4 two-dimensional integer transformation, motion estimation with a precision of up to 1/8 pixels, multi-frame reference and other technologies. H.264 performance improvement is achieved at the cost of increasing complexity. However, the computing power of a single processor cannot meet the different requirements of video encoding, which makes it a trend to adopt parallel video encoding H.264.

2, in H. 264 A video sequence is composed of many image groups (GOP). Each GOP contains many frames, and each frame can be divided into several independent slice instances, slice can further be divided into many macro blocks as basic units of motion estimation and entropy encoding, and the macro blocks can be further divided into smaller ones. All of these operations may carry out parallel operations at different levels.

1. Frame-level parallel

Frames that need to be processed in parallel for parallel operations are independent of each other. Frames in H.264 can be divided into three types: I, P, and B. In general, I frame does not need a reference frame. P frame uses P in front of it as a reference frame, and B frame uses P in front of it as a reference frame.

Generally, we compile a frame sequence using the GOP structure: ibbpbbpbbpbbp... P frame B and the following p Reference, B is not a reference frame. 3. The encoder first completes the compilation of 0th frames (I frame), 3rd (p) frames, and 3rd P frames (P frame, the first two B frames of 1st and 2 and the next 6th P can be simultaneously encoded. It can be seen that P frames are an important point. To speed up encoding of P frames, you can prepare more frames to reduce Idle threads.

2. Slice-Level Parallelism

In H. 264 encoding, the encoder divides an image into a limited number of slice. Each slice has a relatively independent syntax structure in the standard, and there is no reference relationship between different slice of the same image. It can be seen that slice can also be used as the basic scheduling unit for parallel encoding in parallel operations in multiple threads. However, the division of different slice will significantly affect the bit rate.

Figure 4 shows the relationship between bit rate and SNR when an image is divided into different numbers of slice. As shown in the figure, to achieve the same encoding quality, the bit rate will increase significantly with the increase in the number of slice segments. This is because slice's division will interrupt the relevance of the macro block in the frame. When the macro block cannot be compressed by the correlation with the Macro Block in another slice, the compression efficiency will be reduced.

A small number of divided slice will reduce the efficiency of parallel processing, but a large number will reduce the compression efficiency, this requires a different compromise method based on different situations.

3. macroblock-Level Parallelism

In H.264, a macro block is encoded with 5 adjacent to the top left, top left, top right, and top right.

* Intra-frame prediction: the predicted values of pixels in the current block are calculated using the macro blocks in the left, top left, top right, and top right directions.

* Motion vector prediction: like the above, the motion vectors of the current block are also obtained through the motion vectors of the adjacent left, top left, top right, and top right blocks.

* Loop Filtering: to reduce the distortion of the edge of the current block, the Left Block adjacent to it and the above block should be used for filtering.

Because of this dependency between macro blocks, the current block must be encoded only after the adjacent blocks have been encoded. This is restricted by the processing sequence. The upper-left macro block must be processed before other blocks can be encoded. However, as shown in figure 6, after processing the macro Blocks 1 and 2, the block on the right and the block on the bottom left of Block 2 can be processed.

We assume that the number of macro blocks in a frame is W and h respectively, and the MB (I) indicates the I Macro Block in the order of Grating Scan in a frame. Considering the inter-block properties within the frame, we can see that MB (I) and MB (I + W-2) can be processed simultaneously.

With MB (1), MB (2), MB (3), and MB (W + 1), MB (W + 2) can be processed at the same time, MB (4) it can also be well processed. In general, the number in the timing arrangement of Macro Block Processing is shown in {MB (1)}, {MB (2)}, {MB (3 ), MB (W + 1) },{ MB (4), MB (W + 2) },{ MB (5), MB (W + 3 ), MB (2 W + 1 )},......, {MB (h-1) * w), MB (H * W-2)}, {MB (H * W-1)}, and {MB (H * w )}.

Summary

Experiment with Intel P4 2.4ghz processor, which is widely used with hyper-Threading Technology, makes it easy to see that after a program uses OpenMP for parallel processing, the CPU usage during running is greatly improved, from 50% to about 95%, and the running time is greatly shortened. H.264, as a new video compression standard, significantly increases the computing workload while improving the performance. However, from the analysis above, we can see that there are quite a few places in the processing process that can perform parallel operations. Using OpenMP to optimize the parallel processing of H.264 encoder, coupled with the support of hyperthreading technology processors, will greatly improve the performance of the encoder.

From modern TV technology

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.