Reading a file may become a speed bottleneck during large-scale data processing. Whether your CPU has 4 or 8 cores, the clock speed is 2 GB or 3 GB, and the hard disk IO speed is always limited. In my recent experience, it took 34.8 seconds to process a text of 11 GB, of which 30.2 seconds was used for access IO, accounting for about 87% of the total time.
Although there is an upper limit on hard disk I/O, can all the functions provided by C ++ allow us to reach this upper limit? To get the truth, I used the fread function to read the 11g text. In linux, I used iostat to check the access speed of the hard disk and found that the read speed was about 380 M/s. Then we tested the access speed of the read text using the dd command, and found that the speed can reach 460 Mb/s. It can be seen that fread access by a single thread does not reach the hard disk read limit. The first consideration is whether fread has some fixed overhead when accessing the hard disk. using multithreading can achieve the effect of sequential access IO and improve the efficiency of reading text, the results show that multithreading only has a read rate of 380 Mb/s.
Why is fread less efficient? Read some information to learn how to access the hard disk in fread/fwrite mode. You must specify the number of data to be read to the kernel, and then copy the obtained content from the kernel buffer pool to the user space; writing also requires a process like this. In this way, when I/O is accessed, the buffer of such a kernel is often used, resulting in speed restrictions. One solution is mmap. Mmap directly maps a part of the file to the user space. You can directly read and write this part to the kernel buffer pool, in this way, the back-and-forth copying of the kernel and user space is usually faster.
To illustrate this problem from the data, I quoted a netizen's conclusion, hoping to inspire everyone.
Method/platform/time (SEC) Linux gcc Windows mingw Windows VC2008
Scanf 2.010 3.704 3.425
Cin 6.380 64.003 19.208
Cin cancel synchronization 2.050 6.004 19.616
Fread 0.290 0.241 0.304
Read 0.290 0.398 not supported
Not supported by mmap 0.250
Pascal read 2.160 4.668
Author: jiang1st2010