[Basic knowledge] what is dma?
DMA = Direct Memory Access. This is a data transmission mechanism implemented by hardware. Simply put, data is transmitted without the involvement of the CPU.
[/Basic knowledge]
Not quite clear? Here is a simple example:
For example, if there is an array A, I want to transfer the content in this array to another array B. We assume that the two arrays are the same size. For example, int A [10000]; int B [10000];.
Then I can do this:
[Code = C] For (INT x = 0; x <sizeof (a)/sizeof (INT); X ++ ){
B [x] = A [x];
}
[/Code]
Each element in the array is passed in a loop. This is the simplest and easiest way to understand.
However, this method is simple and inefficient. B [x] = A [X]. This is not what you see, assign the element value in a [x] to B [X].
What is the process?
This is actually the case: First assign the element value in a [x] to a register in a CPU, and then assign the value of this register to B [X].
Why?
This is because both A and B are in the memory, and the CPU does not allow direct data transmission in the memory. Therefore, in this process, the CPU must be considered as an intermediary.
We can imagine that every assignment requires an intermediary, so the efficiency is reduced.
Since the problem is that the CPU is used as an intermediary, how can we avoid this bottleneck? Yes. That is, DMA.
DMA is a hardware device. The working principle of this device is as follows:
-- First, the CPU tells the DMA device that there must be a pile of data to be transmitted, so that it can be used for efficiency. (DMA request)
-- The DMA receives the message from the CPU and starts preparation. In this case, the CPU will tell the data source address, data target address, data volume transmitted, transmission mode, and other parameters. (DMA initialization)
-- After the DMA Initialization is complete, the system sends a message to the CPU, "I want to start data transmission when your bus is used !" (Bus out, DMA start)
-- After the CPU receives the message, it temporarily disconnects itself from the bus. DMA starts data transmission. (DMA data)
-- After the DMA data is transmitted, the system sends a message to the CPU! Return the bus to you ." (Bus return)
-- CPU: "Good job! The old man will be at the top! ." The DMA device is stopped. What should the CPU do.
Because it is implemented by hardware, the DMA speed is very fast. To what extent? In ds, especially when the data volume is very large, the efficiency can be improved by more than 1 million times compared with that of CPU as an intermediary.
Because the DMA speed is so fast, it is generally required to use DMA for a large amount of data transmission.
Then, the example just now can be written:
[Code = C] dmacopy (void *) A, (void *) B, sizeof ());
[/Code]
However, DS is a very special platform. In some cases, DMA is not applicable. In some memory areas, DMA cannot be accessed. That is, bios, TCM, and cache.
[Basic knowledge] BIOS is a hardware-protected memory area. This area is normally "read/write protection ". That is to say, using the normal method is unable to access this memory. Naturally, DMA cannot be accessed. Read the BIOS directly, and read all random data.
So what should I do if I want to dump the BIOS?
This requires some skills. I will not talk about it now. I will fill in this pitfall at the end of the tutorial.
[/Basic knowledge]
[Basic knowledge] TCM = tightly coupled memory. This is a high-speed cache, which is said to be directly integrated into the CPU chip. DS has two types of TCM: itcm (Instruction TCM) and dtcm (Data TCM ). You do not need to explain what the two TCM are.
[/Basic knowledge]
Because of its high-speed cache, these two memory areas are used for special purposes. For example, some codes with very strict time requirements can be put into itcm for execution. This can effectively improve the running speed. Some data that requires frequent access can also be stored in dtcm to save access time.
How to put the code in itcm? There are two methods. One is to use the "attribute tag" unique to GCC to assign the specified code to the "itcm" attribute. In this case, the code will be loaded and executed in itcm. Another method is to directly change the. C source file to. itcm. C. At this time, the source file will be directly compiled into the target file running in itcm.
Dtcm is much more convenient. Although the two TCM can be mapped, that is, their addresses are not fixed, the lnkscript of ndslib maps the two TCM to 0x0000000 and 0x0b000000. Now that you have a fixed address, you can easily access it. However, as I mentioned earlier, these two blocks of memory space have special purposes, so direct access is not recommended. Compared with itcm, dtcm is more important. Because in this memory, there is a very important object-stack.
I will not explain the "stack" in detail. Local variables and function call parameters are transferred by stack.
Since DMA cannot access TCM, the stack cannot be accessed. Because local variables are opened into the stack, DMA cannot transmit local variables. For example:
I want to fill the standard color palette of the main engine with random colors.
The following code is incorrect:
[Code = C] void fillrandomcolortomainpalette (){
2017-11-tmppalette [256];
Dmacopy (void *) tmppalette, (void *) bg_palette, sizeof (tmppalette ));
}
[/Code]
The reason is very simple. Although the data in tmppalette is random, this array is a local variable and is opened in the stack, and DMA cannot be accessed.
The following is the correct method:
[Code = C] void fillrandomcolortomainpalette (){
2017-11-tmppalette [256];
Memcpy (void *) bg_palette, (void *) tmppalette, sizeof (tmppalette ));
}
[/Code]
Does memcpy require CPU participation? Isn't it slow?
Yes. It is much slower than DMA. However, we can only use it currently. At the end of the tutorial, I will teach you a fast and secure method.
[Basic knowledge] what is cache?
As we all know, the CPU speed is very fast. When the CPU accesses peripherals, some peripherals are slow and slow to respond to the CPU. At this time, the CPU either waits for the peripheral response, or continues to do its work and other peripheral interruption signals.
However, some peripherals are not interrupted. At this time, the CPU must wait. The most typical example is memory.
When the CPU accesses the memory, it does not immediately access the memory space it wants to access as you think, but a "waitstate" process. Think about it. Every time you access the memory, you have to wait for several machine cycles. This is not a good thing ~~~ In particular, this "Number" is not a simple single digit, sometimes it can even reach three digits.
How can this problem be solved? That is the cache.
Cache is an extremely high-speed cache integrated into the CPU. Note the keyword "extremely high speed ". In general, its access speed is almost comparable to that of the CPU. This means that the CPU will not waste much time accessing the cache. However, the increase in speed is at the cost of capacity. The cache capacity is small. In ds, the data cache (DC) is only 4 K, and the Instruction Cache (IC) is only 8 K.
[/Basic knowledge]
Then, we put common data into the cache, And the CPU can directly access the cache when accessing it, without the need to spend time accessing the memory.
In fact, the CPU is doing this. When reading the memory, the CPU first reads the cache to see if there is any "copy" of the data it wants. If there is one, it would be great to use it directly. If not, you have to spend some time reading the memory. When writing memory, the CPU is directly written to the cache, rather than directly written to the memory.
WhenI?
This is indeed the case. When the cache is full, the data in the cache is updated to the memory and the cache is cleared. Just like sending a mail, all the mail will be first collected to the post office and sent only after a certain number.
However, there is another problem: If the cache contains a "copy" of memory data, the CPU will directly use this copy when reading the memory, instead of reading the memory. If the data in the memory is rewritten, then the CPU reads the memory again. Isn't the old copy instead of the latest memory data? Similarly, if I want to DMA some data, who can ensure that the data in the memory is the latest data?
Someone: Well, I read and write the cache directly. The bird memory is really troublesome for her grandmother!
Unfortunately, the cache is completely black. You don't know its address. You cannot directly access it.
What should we do? Fortunately, ndslib provides us with some functions:
[Code = C] // update the entire data cache to the memory
Void dc_flushall ()
// Update the region specified by the specified address in data cache to the memory
Void dc_flushrange (const void * base, u32 size)
// Clear the entire data cache
Void dc_invalidateall ()
// Clear the region of the specified address in the data cache
Void dc_invalidaterange (const void * base, u32 size)
// Clear the entire Instruction Cache
Void ic_invalidateall ()
// Clear the region of the specified address in Instruction Cache
Void ic_invalidaterange (const void * base, u32 size)
[/Code]
So when will these functions be used?
Before DMA, I need to ensure that the data in the memory of the data source is up-to-date. Therefore, flush is required so that the copy in the DC can be updated to the memory.
After DMA, I need to ensure that the copy in the DC is the same as the data in the memory. But ndslib does not update the DC function, so there is no way, we can only kill the copy in the DC. At this time, if the CPU accesses the memory, because the DC does not have a copy, it can only be accessed directly from the memory and the accessed value is used as a copy of the DC. Therefore, invalidate is required.
If we use all or range, it is obvious that all is the safest option, but it will take more time. So if you are sure, use range to save time; if you are not sure, use all for security.
Note that the simulator is not perfect in the simulation cache. According to my observations, none of the three famous simulators-no $ GBA, desmume, and ideas-can correctly simulate the cache. So if you find problems such as broken, chaotic, and color exceptions on the simulator, try to add dc_flush before and after your DMA function... (...); and dc_invalidate... (...);.
[Fill in] As I said, DMA is very fast. Generally, data transmission depends on it. However, in some cases, DMA cannot be accessed, and I want to speed up. What should I do? Please use swicopy and swifastcopy!
[Code = C] # define copy_mode_hword (0)
# Define copy_mode_word (1 <26)
# Define copy_mode_copy (0)
# Define copy_mode_fill (1 <24)
Void swicopy (const void * Source, void * DEST, int flags );
Void swifastcopy (const void * Source, void * DEST, int flags );
[/Code]
These two functions are called "BIOS Soft Interrupt", also known as "system call ". You can regard it as a function unique to GBA/Ds.
These two functions are amazing. In GBA, they are faster than DMA! However, it is a pity that DS is neither fast nor even slower than memcpy.
These two functions have no region restrictions. Both bios and memory, whether itcm or dtcm, can be accessed directly.
Now let's rewrite the example of the random color palette:
[Code = C] void fillrandomcolortomainpalette (){
2017-11-tmppalette [256];
Swicopy (void *) tmppalette, (void *) bg_palette, copy_mode_word | copy_mode_copy | (sizeof (tmppalette)> 2 ));
}
[/Code]
Note the last parameter. The Unit is word, which is 4 bytes. Therefore, the data size must be divided by 4 to convert to word.
Swifastcopy is faster than swicopy. If you are pursuing a higher speed, you can change it to swifastcopy. However, note that this function can only transmit the half width, that is, 2 bytes of data. Therefore, the copy_mode_word mode cannot be used. However, the size of the transmitted data is still in the unit of word.
[/Fill in traps]
Finally, let's fill in the pitfalls. This is the BIOS function of dump ARM9:
[Code = C]
# Define bios_address 0xffff0000
# Define bios_size 32768
Void dumparm9bios (){
Void * tmpbuffer = calloc (bios_size, 1 );
Swicopy (void *) bios_address, (void *) tmpbuffer, copy_mode_word | copy_mode_copy | (bios_size> 2 ));
Dc_flushall ();
File * f = fopen ("arm9.bios", "WB + ");
If (F! = NULL) fwrite (tmpbuffer, bios_size, 1, F );
Fclose (f );
Free (tmpbuffer );
}
[/Code]
YuuhimesamaPosted on
Sit down and take a look ..
That is to say, the DMA mode is not enabled for R4...
GbaypPublished on
Ox B {: 5_174 :}
Wjzj5886Published on
Although the technical support does not understand the function, the basic principle is understood.
5826659Published on
You must master the compilation for the development of the host category ····
ImcomePosted on
Good, huh, huh
Tomorrow frogPosted on
Understand a little ~~~
Tomorrow frogPosted on
Haha ~~~ You can use dmacopy to write code later ~~~ It turns out that memcopy is not so fast ~~~
17625Posted on
Great!
Thanks a lot,
Phantom GodPosted on
Let's change the mistake... some children's shoes may be found here through Google or something...
[B] average efficiency on NDS swicopy <swifastcopy <memcpy <dmacopy...
[/B]