Before optimizing memcpy, you need to know the information related to cache prefetch:
Commands related to cache prefetch: Operation Code Command Description 0f 18/1 prefetcht0 M8 Prefetch data to all levels of cache, including L0 . 0f 18/2 prefetcht1 M8 Prefetch data to Division L0 All levels of cache. 0f 18/3 prefetcht2 M8 Prefetch data to Division L0 And L1 All levels of cache. 0f 18/0 prefetchnta M8 Prefetch data to a non-temporary buffer structure to minimize cache pollution.
Intel C ++ Compiler
Intrinsic
Equivalent method: Void _ mm_prefetch (char * P, int I) prefetch data cache with the size of cache line from address P. Parameter I indicates the prefetch method (_ mm_hint_t0, _ mm_hint_t1, _ mm_hint_t2, _ mm_hint_nta, indicating different prefetch methods respectively) If we have already manually loaded the data to the cache before CPU operations, this reduces the amount of data that needs to be retrieved from the memory due to cache miss, in this way, the operation can be accelerated to improve the performance. Using the active Cache Technology to optimize the memory copy should theoretically improve the performance, which seems to be worth a try.
Note,CPUAbsolutely free to operate on data! Using prefetch commands is only based on our own ideas.CPUTo supplement the data operations.CPUAt present, we do not need to load the data to the cache. In this way, our prefetch command may bring the opposite result. For example, for a multi-task system, we may have washed out the useful cache. However, in a multi-task system, switching between threads or processes takes a long time than prefetch operations. It seems like a century, therefore, the impact of thread or process switching on Cache prefetch can be ignored.