I. What is PC?
"Then Pc = PC + 1", which is often said by the teacher.
This is not completely correct. In the case of a self-increment of one in the PC, it is pointed out that in the case of non-pipeline, the pointer is obtained, decoded, and the operator is executed in sequence. However, when there is a flow of water, it is more complicated. Here we use the third-level assembly line of ARM7 as an example.
The pipeline uses three stages, so commands are executed in three stages: 1. (load an instruction from memory); 2. decoding (identifying the commands to be executed); 3. execute (process the command and write the result back to the register ).
R15 (PC) always points to the instruction that is taking the finger, instead of the instruction that is executing or decoding. In general, it is customary to refer to the first instruction as a reference point, so PC always points to the third instruction. When the arm status is set, each command is 4 bytes long, so the PC always points to the command address plus an 8-byte address, that is, the PC value = the current program execution Position + 8;
The other pipelines are analogous here.
Ii. Arm Assembly Line Overview
Introduction
Pipeline technology shortens program execution time and improves the efficiency and throughput of the processor core by running multiple functional components in parallel, thus becoming one of the most important technologies in microprocessor design. The ARM7 processor core uses a typical three-level assembly line structure of Feng nuiman, while the arm9-series uses a five-level assembly line structure of Harvard. The logic at all levels of the assembly line is simplified by increasing the assembly line level, which further improves the performance of the processor.
The third-level assembly line of ARM7 completes a lot of work in execution units, including operations related to registers and memory read/write, Alu operations, and data transmission between related devices. The execution unit usually occupies multiple clock cycles, thus becoming the bottleneck of system performance. Adopts a more efficient five-level pipeline design, adds two functional components to access the memory and write back the results respectively, and transfers the read register operation to the decoding component, it balances the functions of all components in the pipeline, while its Harvard architecture avoids conflicts between data access and the access bus.
However, whether it is a three-level or five-level assembly line, when multiple-cycle commands, jump branch commands, or interruptions occur, the assembly line will be congested, in addition, the pipeline may be blocked due to register conflicts between adjacent commands, reducing the pipeline efficiency. Based on a detailed analysis of the principle and running status of the assembly line, this paper studies how to improve the running performance of the assembly line by adjusting the command execution sequence.
1. ARM7/arm9-assembly line technology
1.1 ARM 7 assembly line technology
Each instruction in the ARM7 series processor is divided into three stages: Sub-pointing, decoding, and execution, which are independently completed on different functional components. The receiving component loads an instruction from the memory and uses the decoding component to generate the control signal required for the next cycle of data path and complete register decoding, it is then sent to the Execution Unit to complete the reading of registers, the ALU operation, and the write-back of the operation result. The instructions for accessing the memory are required to complete the access to the memory. Although a single instruction still needs three clock cycles on the pipeline, the processor throughput is about one instruction per cycle through parallel processing of multiple components, which improves the processing speed of the stream instruction, the command execution speed can reach o.9 MIPS/MHz.
When you access a PC (program counter) through R15 in a three-level pipeline, the fetch position and execution position are different. This should be taken into consideration based on the execution of the pipeline. After taking the finger of the PC, the PC + 4 is sent to the PC and the obtained command is passed to the decoding component, then, the pointing part is retrieved based on the new PC. Because each command contains 4 bytes, the PC value is equal to the current program execution Position + 8.
1.2 arm9-assembly line technology
The logic line of the arm9-series processors is divided into fetch, decoding, execution, memory access, and write-back. The reading part reads the register operand from the instruction memory, which is very different from the data path in the third-level pipeline; the Execution Component generates the ALU operation result or the memory address (for memory access instructions); the access component accesses the data storage; the write-back component completes the execution result and writes it back to the Register. The execution units in the three-level pipeline are further refined to reduce the workload that must be completed in each clock cycle, thereby allowing high clock frequencies and separate instructions and data storage, this reduces the number of conflicts, and the average cycle of each command is significantly reduced.
2 3-level pipeline operation analysis
When the third-level pipeline processes simple register operation instructions, the throughput is an average of one instruction per clock cycle. However, when there are memory access instructions and jump instructions, the pipeline will be blocked, this causes the performance of the pipeline to decline. Figure 1 shows the optimal running condition of the pipeline. The mov, add, and sub commands in the figure are single-cycle commands. Starting from T1, three commands are executed in three clock cycles, and the average number of instruction cycles (CPI) is equal to one clock cycle.
Blocking is also common in the pipeline. The performance of the pipeline under various blocking conditions is analyzed in detail below.
2.1 pipelines with memory access instructions
LDR is a non-single-cycle instruction for memory access, as shown in figure 2. In the execution phase, these commands must first calculate the memory address and occupy the control signal line. The decoding process also needs to occupy the control signal line, so the next command (the first sub) the decoding is blocked, and the execution of the next sub is also blocked because LDR needs to continue to occupy the execution unit during access to the memory and write-back register. Due to the adoption of the Von noriman architecture, data storage and command storage cannot be accessed simultaneously. When LDR is in the memory access period, the read/write of the mov command is blocked. Therefore, the processor executes 6 commands in 8 clock cycles, and the average number of instruction cycles (CPI) is 1.3.
2.2 pipelines with Branch commands
When the command sequence contains commands with branching functions (such as BL), the pipeline is also blocked, as shown in 3. When the branch command is executed, the last 1st commands are decoded, And the last 2nd commands are retrieved. However, the commands in these two steps are not executed. After the branch command is executed, the program should go to the target address of the jump and execute it. Therefore, the two Commands need to be discarded on the pipeline, at the same time, the program counter will be transferred to a new location for finger fetch, decoding, and execution. In addition, some special transfer commands need to write link registers and program count registers at the same time as the jump is completed. For example, the BL execution process includes two additional operations: Write link registers and adjust program pointers. These two operations still occupy the execution unit, and the pipeline in which the decoding and the finger are obtained is blocked.
2.3 interrupted Pipeline
The occurrence of processor interruptions is uncertain and has nothing to do with the commands currently executed. When an interruption occurs, the processor always executes the currently executed command and then responds to the interruption. 4. As shown in, IRQ is interrupted during the execution of the add command at ox90000. At this time, the IRQ obtains the execution unit after the Add command is executed, and the processor starts to process the IRQ interrupt, save the program return address and adjust the program pointer to the oxl8 memory unit. At oxl8, there is a Iro interrupt vector (that is, the switch to IRQ to interrupt the service). Next, the redirection command is executed to switch to the interrupt service program, and the pipeline is blocked, the process of executing commands at 0x18 is the same as that of pipeline with Branch commands.
Level 3 Assembly Line Technology
5-level assembly line technology is widely used in a variety of RISC processors and is considered as a classic processor design method. The memory access component (memory access component) and register write-back component in the five-level pipeline solve the delay Problem of Memory Access Instruction in the command execution phase in the three-level pipeline. Figure 5 shows the running status of the five-level pipeline (the five-level pipeline is also blocked ).
3.1 5-level pipeline lock analysis
There is only one type of mutual lock in the five-level pipeline, that is, register conflict. The read register is in the decoding phase, and the write register is in the write-back phase. If the destination operand register of the Current Instruction (a) is the same as the source operand register of the next instruction (B), the instruction B needs to be decoded after a write-back. This is the register conflict in the five-level pipeline. As shown in figure 6, the LDR command writes R9 In the write-back phase, while the R9 needed in mov is the register value that LDR will re-write in the write-back phase, and the mov decoding needs to wait, the operation is complete until the register write-back operation of the ldr command is completed. (Note: In the current processor design, the pipeline can be optimized through the Register bypass technology to solve the register conflict problem in the pipeline .)
Although the pipeline lock increases the code execution time, it provides great convenience for the designers in the early stage. You do not have to consider whether the registers used will cause conflicts; in addition, compilers and assembler programmers can redesign the code sequence or other methods to reduce the number of mutual locks. In addition, the occurrence of branch commands and interruptions still blocks the five-level pipeline.
3.2 5-level assembly line Optimization
Re-designing the code sequence can effectively reduce the congestion of the pipeline in many cases, so that the pipeline runs smoothly. The following describes in detail how code optimization optimizes the pipeline and improves efficiency.
The data at the memory address 0x1000 and ox2000 must be copied to 0x8000 and 0x9000 respectively.
Oxl000 content: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ox2000 content: H, E, l, l, O, W, O, R, L, d
Figure 7 shows the execution time of the program code and instructions in the first copy process.
The whole copy process is independently completed by two cycles with the same structure. The two data copies are achieved respectively, and the two copy processes are extremely similar. You can analyze one of them.
T1 ~ T3 is three separate clock cycles; T4 ~ T11 is a loop that describes the execution of the first loop in the spatiotemporal diagram. When LR is written in T12, the first statement of the loop is retrieved. Therefore, the total pipeline cycle is 3 + 10 × 10 + 2 × 9 = 121. The entire copy process requires 121X2 + 2 = 244 clock cycles.
Considering that the execution efficiency of the pipeline can be improved by reducing the conflicts of the pipeline, and the conflicts of the pipeline mainly come from register conflicts and branch commands, the code is adjusted in the following two aspects:
① Merging two cycles into one loop can fully reduce the number of cycle jumps and reduce the pipeline stagnation caused by the jump;
② Adjust the code order and insert registers that are irrelevant to the neighboring commands into the instructions with the relevant registers to fully avoid pipeline congestion caused by register conflicts.
The time-space diagram 8 shows the code adjustment and pipeline.
After adjustment, T1 ~ T5 is five separate clock periods, T6 ~ T13 is a loop. At the same time, when the BNE command is writing LR, the First Command of the loop starts to take the finger, therefore, the total number of instruction cycles is 5 + 10 × 10 + 2 × 9 + 2 = 125.
The comparison between the two sections of Code shows that the entire copy process used a total of 244 clock cycles before adjustment, after adjusting the order of instructions in the loop, A total of 125 clock cycles are used to complete the same operation. The clock cycle is reduced by 119, which is shortened by 119/244 = 48.8%, and the efficiency is greatly improved.
The comparison of execution cycles before and after code optimization is shown in table 1.
Therefore, the optimization of the pipeline should be considered in two aspects:
① Reduce the number of branch commands by combining loops and other methods, thus reducing the waste of pipelines;
② By switching the order of commands, the pipeline will be stuck due to register conflicts.
4 Conclusion
The pipeline technology improves the concurrency of the processor and greatly improves the processor performance compared with the Serial CPU. By adjusting the command sequence, the pipeline conflict can be effectively avoided, thus improving the pipeline execution efficiency. Therefore, how to use intelligent algorithms to automatically adjust the command sequence to improve the efficiency of the pipeline and further improve the concurrency of the processor will be the main direction of future research.
Iii. Von noriman architecture and Harvard Architecture
1. Von noriman Structure
The noriman structure is also called the princetionarchitecture ).
In 1945, Feng nuoman first proposed the concept of "Storage program" and the binary principle. Later, people referred to the electronic computer system designed using this concept and principle as the "Feng nuoman-type structure" computer. The processor of the Von noriman structure uses the same memory for transmission over the same bus.
The von noriman structure processor has the following features:
There must be a memory;
There must be a controller;
There must be an iterator for completing arithmetic and logical operations;
There must be input and output devices for human-computer communication.
The main contribution of von noriman is to propose and implement the concept of "Storage program. Since commands and data are both binary codes, and the address of commands and operands are closely related, it is natural to select this structure. However, this command and data share the same bus structure, making the transfer of information flow a bottleneck restricting computer performance, affecting the speed of data processing.
In typical cases, it takes three steps to complete an instruction, namely, obtaining the instruction, decoding the instruction, and executing the instruction. From the regular relationship of the instruction stream, we can also see the difference between the Pattern Processing Method of the Von noriman structure and the Harvard structure. For example, the simplest command for read/write operations on the memory. commands 1 to 3 are memory and Data fetch commands. For the Von noriman structure processor, since commands and data to be accessed must be accessed from the same bucket and transmitted through the same bus, they cannot be overlapped and executed. Only one bucket can be accessed before the next one.
There are many CPUs in the ARM7TDMI series, some of which do not have internal cache. For example, ARM7TDMI is a pure von noriman structure, other CPUs with internal cache and separated Data from instruction cache use the haver structure.
2. Harvard Structure
The Harvard structure is a memory structure that separates program instruction storage from data storage, as shown in 1. The central processor first reads the content of the program instruction in the program instruction memory, decodes the content, obtains the data address, reads the data in the corresponding data storage, and performs the next operation (usually execution ). The program instruction storage and data storage are separated, which can make the instruction and data have different data widths. For example, the program instruction of the pic16 chip of microchip is 14-Bit Width, while the data is 8-bit width.
Figure 1 Harvard architecture Diagram
The Harvard structure microprocessor usually has a high execution efficiency. The program commands and data commands are organized and stored separately. The next command can be read in advance during execution.
At present, there are a lot of central processors and controllers that use the Harvard structure, except for the PIC chips of the microchip company, there are also Motorola's mc68 series, zilog's Z8 series, Atmel's AVR series, and arm's ARM9. arm10 and arm11.
The Harvard structure refers to an independent architecture of programs and data spaces. It aims to reduce the memory access bottleneck during program running.
For example, in the most common convolution operation, a command takes two operands at the same time. During pipeline processing, there is also a finger fetch operation. If the program and data are accessed through a bus, the fetch and fetch operations must conflict with each other, which is detrimental to the execution efficiency of a large number of cycles.
The Harvard structure can basically solve the conflict between the fetch and the fetch.
The access to another operand can only adopt the enhanced Harvard structure. For example, the data zone is split like Ti, and a group of bus is added. Or use command cache like ad, and the command area can store part of the data.
In typical cases, it takes three steps to complete an instruction, namely, obtaining the instruction, decoding the instruction, and executing the instruction. From the regular relationship of the instruction stream, we can also see the difference between the Pattern Processing Method of the Von noriman structure and the Harvard structure. For example, the simplest command for read/write operations on the memory. commands 1 to 3 are memory and Data fetch commands. For the Von noriman structure processor, since commands and data to be accessed must be accessed from the same bucket and transmitted through the same bus, they cannot be overlapped and executed. Only one bucket can be accessed before the next one.
If the Harvard structure is used to process the preceding three access data commands, the commands can be executed in overlapping mode because the commands and access data are transmitted through different buckets and different bus, this overcomes the bottleneck of data stream transmission and increases the computing speed.
3. Differences between the Von noriman System and the Harvard Bus System
The difference between the two is whether the program space and the data space are integrated. The data space of the Von noriman structure is not separated from the address space. The data space of the Harvard structure is separated from the address space.
In the early days, most of the micro-processors were constructed by von noriman, typically Intel's X86 micro-processor. The read-only and read-only operations are performed on the same bus in a time-based manner. The disadvantage is that in high-speed operation, commands and operands can not be obtained at the same time, thus forming a bottleneck in the transmission process.
The Application of Harvard bus technology is represented by DSP and arm. The internal program space and data space of the chip adopting the Harvard bus architecture are separated, which allows both the finger and the operand to be taken, thus greatly improving the computing capability.
The DSP chip hardware structure consists of the Von noriman structure and the Harvard structure. The difference between the two is whether the address space and the data space are separated or not. Generally, DSP adopts an improved Harvard structure, that is, the separated Data Space and address space are not only one, but multiple, which varies with the DSP chips of different manufacturers. In terms of external addressing, the logic is also the same. Because of the external pins, it is generally implemented through the corresponding space selection. It is essentially the same.
4. Improved Harvard structure and Harvard Architecture
Compared with von Norm's structure processor, Harvard's structure processor has two notable features:
(1) Two Independent memory modules are used to store commands and data respectively. Each storage module does not allow the coexistence of commands and data;
(2). Two Independent bus are used as the dedicated communication path between the CPU and each memory, and there is no association between the two bus.
Later, an improved Harvard structure was proposed. Its structural characteristics were as follows:
(1) Two Independent memory modules are used to store commands and data respectively. Each storage module does not allow the coexistence of commands and data;
(2 ). with an independent address bus and an independent data bus, the public address bus is used to access two storage modules (program storage module and data storage module ), the Public Data Bus is used to transmit data between the program storage module or data storage module and the CPU;
(3). The two buses are shared by the program memory and data storage at time.
5. Summary
The architecture is independent of the bus used, and is related to the separation of the Command Space and data space. 51 Single-Chip Microcomputer although the data instruction storage area is separated, but the bus is time-sharing, so it belongs to the improved Harvard structure. Although arm9-is a Harvard structure, the previous version (such as ARM7) is still a Von noriman structure. One important reason why early x86 could quickly occupy the market was that it relied on the simple and low-cost bus structure of von noriman. Although the current processor's external bus looks like the noriman structure, the internal cache actually looks like an improved Harvard structure. As for the advantages and disadvantages, the Harvard structure is complex and requires high connection and processing requirements for peripheral devices. It is not suitable for the expansion of Peripheral memory. Therefore, it is difficult for general-purpose CPUs to adopt this structure in the early days. The microcontroller, because the internal integration of the required memory, so the use of Harvard structure is not a taste. The current processor, relying on the existence of cache, has well integrated the two.