The device is usually programmed by means of a read-write device register, which is programmed on the X86 system by a dedicated IO instruction, and on other systems such as MIPS, SPARC, by mapping the device's registers to the memory address space directly using read-write memory to program the device.
Radeon graphics card provides two ways to program the hardware, a "push mode" is called the direct write register, the other is called pull mode, this blog discusses pull mode, which is also used in the drive mode.
In pull mode, the driver uses the command stream as the programming of the graphics card: The driver will need a series of commands to configure the video card to write to the command buffer, and then enter the yield processor, the video card according to the order of the command to execute these commands, after the completion of the trigger interrupt notification driver. The CPU puts these commands into a ring buffer called the command ring, the command ring is a piece of memory in the GTT memory, the driver fills in the command ring, notifies the GPU that the command has been written to the command, and the GPU command processor, CP, commands Processor. The previous blog is about how to allocate memory in the system and establish mapping relationships through the use of ring memory.
The command stream written by the driver is parsed by the command processor CP, specifically, the CP completes the following tasks:
- The command stream that receives the driver. The driver writes the command flow to the system memory first, then by the CP through the bus Master device access method to obtain, currently supports three kinds of command flow, in addition to the previously said ring buffered command flow, there are indirectly buffered 1 command flow and indirectly buffered 2 command flow;
- Parse the command stream and transfer the parsed data to other modules of the graphics controller, including 3D graphics processor, 2D graphics processor, video processor.
Command Ring buffers
In pull mode, the driver requests a buffer for the command stream in system memory. The GPU will perform operations such as screen drawing based on these command flows. This command buffer is managed in a circular manner and is a system main memory shared by the CPU and GPU, the CPU is responsible for writing to the command package, and the GPU is responsible for reading and parsing the command package. Because the ring buffer states that the CPU and GPU see must be consistent, both the CPU and the GPU maintain and manage the state of the ring buffer: base address, length, write pointer, and read pointer. For the ring buffer to work properly, the CPU and GPU must maintain this state of consistency. The Ring buffer base address and size are already initialized at the first boot of the system, and generally do not change. When you manipulate the ring buffer, the read pointer and the write pointer are modified very frequently. In order to maintain the state consistency of the ring buffer, when the write operator (CPU) updates the write pointer, it must tell the GPU the write pointer. Similarly, when a read operator (GPU) updates A read pointer, it must inform the CPU of the read pointer. Both the CPU and GPU are filled out or extracted from the low address, and once at the end of the ring buffer, continue from the beginning of the ring buffer.
Figure 1
The entire process 1 shows that the left of the host (CPU) and the right GPU each recorded the start address of the command ring, and each saved a read-write pointer, the CPU is written before the first query read pointer, confirm that there is free space to write content and update the write pointer, the GPU read the command and update the read pointer.
Indirect buffering
In the system main memory, in addition to the ring buffer, CP can also be from the indirect buffer 1 and the indirect buffer 2 to obtain the command packet. This process is done in this way: in the main command stream (ring buffer) There is a register that sets the indirect buffer of the CP 1 address and size. Write the Register of the indirect buffer 1 to trigger the CP to take the command stream of the indirect buffer 1 from the provided address. The last command package of the main command sets an indirect buffer of 1 addresses and sizes, and then the CP starts fetching data from the indirect buffer 1. The indirect buffer of 1 of the number of command flows may use indirect buffers 2. As in the previous procedure, write the register of the indirect buffer 1 to trigger the CP to get a new command stream from the indirect buffer 2. The last packet in the indirect buffer 1 stream sets the address and size of the indirect buffer 2. The CP takes the command from the indirect buffer 2 until it is all finished, and returns to the indirect buffer 1 command stream after executing the indirect buffer 2 command. The CP takes the remaining command from the indirect buffer 1 until the end of the indirect buffer 1, returning to the main command stream.
This process is a bit like a function call. When the program encounters a function call during the run, it jumps to the called function entry using the jump instruction and jumps back to the original program location after executing the function. The maximum invocation of this is "depth" of 2.
There is a ring test procedure in the Linux kernel Radeon driver to verify that the ring buffer is working properly, and if the ring test passes, then the GPU and CPU interaction sections are configured correctly to work properly.
The ring buffer mechanism is the same on almost all types of chips, except that the register address of the GPU-side read-write pointer of the R600 chip ring buffer has changed. Linux kernel Drivers code that implements the ring buffer mechanism and the ring test process for different GPU cores is almost identical.
Take out the code for the Ring test procedure from the kernel:
2287 int r600_ring_test (struct radeon_device *rdev)
2288 {
2289 uint32_t Scratch;
2290 uint32_t tmp = 0;
2291 unsigned i;
2292 int R;
2293
2294 r = Radeon_scratch_get (Rdev, &scratch);
2295 if (r) {
2296 Drm_error ("RADEON:CP failed to get scratch Reg (%d). \ n", R);
2297 return R;
2298}
2299 WREG32 (scratch, 0xCAFEDEAD);
2300 R = Radeon_ring_lock (Rdev, 3);
2301 if (r) {
2302 Drm_error ("RADEON:CP Failed to lock ring (%d). \ n", R);
2303 Radeon_scratch_free (Rdev, scratch);
2304 return R;
2305}
2306 Radeon_ring_write (Rdev, PACKET3 (Packet3_set_config_reg, 1));
2307 Radeon_ring_write (Rdev, ((Scratch-packet3_set_config_reg_offset) >> 2));
2308 Radeon_ring_write (Rdev, 0xDEADBEEF);
2309 Radeon_ring_unlock_commit (Rdev);
2310 for (i = 0; i < rdev->usec_timeout; i++) {
2311 tmp = RREG32 (scratch);
2312 if (tmp = = 0xDEADBEEF)
2313 break;
2314 Drm_udelay (1);
2315}
2316 if (i < rdev->usec_timeout) {
2317 drm_info ("Ring test succeeded in%d usecs\n", i);
2318} else {
2319 drm_error ("radeon:ring Test failed (Scratch (0x%04X) =0x%08x) \ n",
2320 Scratch, TMP);
2321 r =-einval;
2322}
2323 Radeon_scratch_free (Rdev, scratch);
2324 return R;
2325}
2294 row Gets an available scratch register, the scratch register is a register that is not defined by the function, and the function is defined by the (drive) software.
2299 lines Write the value "0xCAFEDEAD" directly to the register using Mmio, at which time the contents of the scratch register are 0xCAFEDEAD.
2300 lines Apply 3 DWORD to the ring buffer mechanism in the kernel driver (GPU commands are measured in 4 bytes), and the ring buffer is also locked because there are multiple programs accessing the ring buffer concurrently.
2306-2308 lines of code to the ring buffer memory just applied to write 3 DWORD command, about the GPU command in the next chapter will be described in detail, here the command means to write to the scratch register value "0xDEADBEEF".
2309-Line Commit command, the above three lines of code written by the command written to the ring buffer will not be executed, until the call Radeon_ring_unlock_commit after the command will be executed.
The 2310-2314 line is the process of checking the scratch register by polling, if the above command is working properly, then the value of the scratch register will be "0xDEADBEEF", otherwise the command does not run properly and the ring test fails.
As you can see from the example code above, the Radeon kernel driver uses the following three functions to manipulate the ring buffer:
API interface Functions |
Function |
Parameters |
Radeon_ring_lock |
Request the ring buffer memory and lock the ring buffer, if the ring buffer is exhausted, update the CPU-side read pointer |
n is the number of DWORDs requested |
Radeon_ring_write |
Writes command and command arguments to the ring buffer, where only the CPU-side write pointers are updated |
|
Radeon_ring_commit |
Update the GPU-side write pointer to release the ring buffer lock |
|
Need to mention is the scratch register, scratch Register is the GPU reserved for the software to use the Register, r300 the previous video card only 5 scratch registers, the future graphics card has 7 registers, the GPU itself does not rely on these registers to configure it, the software can customize its function. The above code is only used to verify that the command is executed correctly, but the subsequent polling process is instructive: When did the software send the command until the command was executed? You can follow this procedure, add a command at the end of the command to write the Scratch register (of course, you must ensure that the value written to the scratch register and the original value of the scratch register), and then poll the scratch register, If this register is written to the value that we require to write, then we can make sure that the command has been executed. Here actually defines a hardware and software synchronization mechanism, the latter section of the interrupt mechanism will discuss the implementation of the fence mechanism in the drive, the fence mechanism is implemented using interrupts, but it uses the thought we mentioned above.
After the description above, reading the implementation code of the ring buffer should not be difficult to read.
After the ring test is completed in the Linux kernel, there is a indirect buffer test procedure. This process, like the operation done by the ring test process, writes the scratch register.
2660 int r600_ib_test (struct radeon_device *rdev)
2661 {
2662 struct Radeon_ib *ib;
2663 uint32_t Scratch;
2664 uint32_t tmp = 0;
2665 unsigned i;
2666 int R;
2667
2668 r = Radeon_scratch_get (Rdev, &scratch);
......
2673 WREG32 (scratch, 0xCAFEDEAD);
2674 r = Radeon_ib_get (Rdev, &ib);
......
2679 ib->ptr[0] = PACKET3 (Packet3_set_config_reg, 1);
2680 ib->ptr[1] = ((Scratch-packet3_set_config_reg_offset) >> 2);
2681 ib->ptr[2] = 0xDEADBEEF;
2682 ib->ptr[3] = PACKET2 (0);
2683 Ib->ptr[4] = PACKET2 (0);
2684 ib->ptr[5] = PACKET2 (0);
2685 Ib->ptr[6] = PACKET2 (0);
2686 Ib->ptr[7] = PACKET2 (0);
2687 ib->ptr[8] = PACKET2 (0);
2688 ib->ptr[9] = PACKET2 (0);
2689 ib->ptr[10] = PACKET2 (0);
2690 ib->ptr[11] = PACKET2 (0);
2691 ib->ptr[12] = PACKET2 (0);
2692 ib->ptr[13] = PACKET2 (0);
2693 ib->ptr[14] = PACKET2 (0);
2694 ib->ptr[15] = PACKET2 (0);
2695 ib->length_dw = 16;
2696 r = Radeon_ib_schedule (Rdev, IB);
......
2703 r = radeon_fence_wait (Ib->fence, false);
......
2708 for (i = 0; i < rdev->usec_timeout; i++) {
2709 tmp = RREG32 (scratch);
2710 if (tmp = = 0xDEADBEEF)
2711 break;
2712 Drm_udelay (1);
2713}
.....
2721 Radeon_scratch_free (Rdev, scratch);
2722 Radeon_ib_free (Rdev, &ib);
2723 return R;
2724}
The contents of line 2668-2673 are the same as the ring test.
2674 rows get a indirect from the system the position of the indirect buffer in memory is recorded in Buffer,ib->ptr.
2679-2694 fill in the commands and parameters to indirect buffer, the commands and parameters filled in here are the same as the commands and parameters that are filled out in the ring test, but there are also alignment requirements.
2696 rows of filled indirect buffer are added to the dispatch queue.
The 2703 line involves the fence mechanism, which we'll cover in detail in the interrupt mechanism section.
The same code that reads the indirect buffer mechanism is not too difficult.
Indirect buffer to be able to function properly, it must be inserted into the code of the ring buffer, which is similar to inserting a "call xx" directive into the assembly code for function calls. The command added by the Radeon_ring_ib_execute function is equivalent to the call instruction used when the function is called.
The next article will describe the format of these commands and give some examples.
Resources:
This section of the description is basically from the "Radeon r5xx acceleration" document.
"Graphic Engine Resource Management" There are some improvements to command scheduling, which can be used as a reference for further study.
Graphics system in "original" Linux environment and AMD R600 graphics Programming (5)--AMD graphics display command processing mechanism