Large-scale data processing [4] (pipeline)

Last Update:2018-12-04 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We will see such source code

Bool dosomething (Int & count, Int & sum)
{
If (Likely(Count <sum )){
If (Unlikely(Count <zero ))

{
Print_error (lessthanzero );

Return false;

}
Count ++;
}

Return true;
}
What is likely and unlikely? We call it a branch prediction prompt to facilitate command prefetch.

One of the most common optimization technologies in Linux kernel is _ builtin_expect. When developers use conditional code, they often know which branch is most likely to be executed, and which branch is rarely executed. If the compiler knows this prediction information, it can generate the best code around the branch that is most likely to be executed.

As shown in the following figure, __builtin_cmdct is used based on two macros: likely and unlikely (see./Linux/include/Linux/compiler. h ).

# Define likely (x) _ builtin_exact CT (!! (X), 1)
# Define unlikely (x) _ builtin_exact CT (!! (X), 0)

Or you will see such source code

For (size_t I = 0; I <CNT; I + = 8)
{
Buffer [I] = value;
Buffer [I + 1] = value;
......
Buffer [I + 7] = value;
}

Why do we need to expand it cyclically?

To understand these two pieces of code, you must understand the CPU command line. We will discuss them later. Finally, we will review these two examples and provide some guiding ideas for coding.

Reduce the time required for each instruction in the CPU instruction set to maximize the CPU usage. Although some are beyond the control of software engineers, understanding these features, it can be a program that is more oriented to this kind of hardware design, so as to get the return of optimization.

The design goal of a streamlined instruction computer (RISC) processor is to execute an instruction on average every clock cycle, although the process of executing a command is simplified, it still requires multiple steps (multiple clock cycles). How can we execute one command every cycle?

The answer is parallelism.

We examine such a simple command

MoV ([EBX], eax) Steps

(1) obtain the operation code of the instruction from the memory (that is, the mov instruction)

(2) Update the EIP register and change its value to the byte address followed by the operation code (for example, the next instruction of mov in the instruction flow is jnz, then the EIP value points to the address of the jnz command.

(3) decode the operation code to obtain the specified command (mov must be translated into a command that can be executed by the machine)

(4) from the original register (from the EBX register)

(5) store the value in the target register (written to the eax register)

Of course, the steps are relatively simple and more complex because they are all operations between registers. If the operand comes from the memory, the EIP register also needs to make some changes. The operation code + the operation code, that is to say, the EIP must know the length of the operand to know the location of the next operation code.

This is not intended to be discussed. Let's look at the implementation of a basic pipeline.

Assume that a 6-level flow is defined

Get operation code decoding operation code (with prefetch operations) Calculate valid address obtain address value computation save result

Let's look at the process of such a clock cycle and command execution, assuming that both can be executed in parallel (we will discuss the stagnant pipeline later)

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12

Command 1, 7 get code and decode return address value calculation memory value get code decoding fetch address value calculation memory value

Command 2 code retrieval and decoding address retrieval value calculation memory value

Command 3 code retrieval and decoding the address value is used to calculate the memory value.

Command 4 Code retrieval and decoding the address value is used to calculate the memory value.

Command 5 code retrieval and decoding address retrieval value calculation memory value

Command 6 code retrieval and decoding address retrieval value calculation memory value

Ideally, we can see that in the six periods from T1 to T6, the assembly line is full and the first six commands are loaded into the assembly line in sequence. From the production results, we can see that the pipeline has completed instruction 1 since T6, and T7 has finished instruction 2 ,...... t11 completed Command 6, T12 completed command 7 at the moment, produced a cycle. In this way, a command is executed every cycle (starting from T6), which is the result of concurrent commands.

However, it is not hard to understand that when executing a command, we must be able to guess the location of the next command correctly before we can perform the correct prefetch. A wrong guess will cause the entire pipeline to be destroyed, reinitialize. Therefore, we can see that in the first example, the instruction in the Encoding Process tells the compiler that it is more likely to be the next instruction, to the greatest extent, we can avoid the problem caused by the mistake of the next command. We also need to avoid the jump. Any jump command will let the compiler guess the next address, and it will always be wrong, therefore, the pipeline-friendly Code requires that you avoid redirection as much as possible. Loop expansion is an example.

The other is to note that the number of stages in the pipeline is not a simple 6-segment here. Different hardware is divided into different segments. The deeper the stream, the higher the prefetch failure cost, and the deeper the flow, the higher the frequency.

To sum up, in the coding process, on the one hand, it can help the compiler to guess the location of the next instruction through special optimization; on the other hand, you can select algorithms with fewer jumps to obtain pipeline-friendly algorithms. For example, you can use inverted tables to compress the pfordelta Algorithm without having to jump. You can also reduce the number of jumps by repeating the expansion and display.

Of course all mentioned here are ideal cases, but in fact the pipeline will be stagnant, including (1) bus contention (2) Data-related (3) an incorrect command, 3rd of them have already been discussed. Next time we will discuss (1) and (2), as well as some ideas about disorderly execution.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More