From: http://simpleframework.net/blog/v/8486.html
1. Completely Random write or skip, 5 times the performance gap!
Full random write is undoubtedly the slowest write method. In the logic dump test, we were surprised to find that MB of memory data was written to GB of disk data with the machine, it took as much as two hours. The reason is that although only 2 million MB of data is available, it is actually 2850 random writes. According to the test, on 150 machines, such a completely random write, R/S is about ~ Between 350, R/S is hard to reach 180 on 250 machines, so it is no wonder that 2 ~ 3 hours.
How can we improve the slow random write speed of a single thread. One way is to convert completely random writes into ordered skip random writes. The implementation method can be simply caching in the memory for a period of time and then sorting, so that when writing a disk, it is not completely random, but that the disk head is moved only to one direction. According to the test, I was shocked again. Simply sorting in the memory directly reduced the write disk Time To 1645 seconds, and the disk R/S increased to more than 1000. The speed of writing a disk has suddenly increased by 5 times.
It should be noted that this skip write improves the performance to move in a single direction with the head, which is very vulnerable to other factors. In the test, the test mentioned above only writes block files, but if you add a small file that writes index in the processing of each tid. Although it takes almost no time to write only the index small file, if it is included in the middle of the block file, it may have a huge impact on the overall write performance, because it may make the Disk Head need to run back and forth between the two places. According to the test, if you only write the index file, you only need 2 million s to write all tids. If you put the write index and write block together, the total time is much greater than the sum of the two parts respectively. In this case, one solution is not to fl the disk with small data volumes in real time and use the application layer cache to cache indexes with small data volumes, this eliminates the impact on block file writing.
In principle, the above representation is explained. Generally, the process of reading data from a hard disk is like this. First, move the head to the area where the data on the disk is located before reading the data. The process of moving the head can be divided into two steps. One is to move the head to the specified track, that is, to find the track. This is a step to move the disk in the radial direction, the time spent is called "Seek time". The second is to rotate the disk to the corresponding sector. The time spent is called "latent time" (also known as latency ). That is to say, before reading data on the hard disk, the time required for preparation is mainly the sum of "Seek time" and "latent time. The real data read time is a fixed value determined by the data size, disk density, and disk speed. It cannot be changed at the application layer, however, the lack of application layer can reduce the "Seek time" and "latent time" by changing the access mode to the disk ",
The Method of Using Cache and sorting at the application layer is undoubtedly to shorten the disk addressing time. Because the head is a physical device, it is easy to understand why reading and writing other small files in the middle slows down a lot.
Suggestion: Avoid completely random writing as much as possible. When multi-line processing is not available, use the application layer cache as much as possible to ensure that the writing disk is sequential as possible. Other files with small data volumes can be stored in the application layer cache all the time to avoid impact on writing other large data volumes.
2. multithreading random read, processing speed, response time
The processing speed of multi-thread random read can be more than 10 times that of single-thread random read, but the same process also increases the response time. The test conclusion is as follows: (each thread should read as much as possible)
The conclusion indicates that increasing the number of threads can effectively improve the overall Io processing speed of the program. At the same time, the response time of each Io request increases significantly.
The underlying implementation explains this phenomenon: IO requests at the application layer are added to the IO Request queue in the kernel state. When processing IO requests, the kernel does not simply process them first, but uses an elevator algorithm based on the characteristics of the disk. After processing an IO request, the nearest Io request is processed first. This can effectively reduce the disk seek time, thus improving the overall Io processing speed of the system. However, for each Io request, the response time may be improved because it may need to wait in the queue.
The response time is increasing mainly because every thread should be read as much as possible during the test. In practical applications, our programs are not under this pressure. Therefore, in programs with IO becoming a bottleneck, we should try to use multiple threads to process different requests in parallel. The selection of the number of threads also needs to be measured through performance tests.
3. Whether to use direct Io
First, let's look at the test conclusion:
It can be seen that the non-Dio mode is faster in small data volumes, but as the data volume increases, the dio mode is faster and the demarcation line is around 50 GB. (Note: The test is based on the number of threads: 50. read data from each read operation: 4 K; machine: del 180; Memory: Total machine memory: 8 GB; other programs occupy 3 GB, the remaining 5 GB may vary depending on other conditions .)
4. system cache
4.1. Several kernel parameters related to system cache:
1./proc/sys/Vm/dirty_background_ratio
This file indicates the percentage of dirty data in the system's overall memory. At this time, the pdflush process is triggered to write the dirty data back to the disk.
Default setting: 10
2./proc/sys/Vm/dirty_expire_centisecs
This file indicates that if the residence time of dirty data exceeds this value in the memory, the pdflush process will write the data back to the disk next time.
Default setting: 3000 (1/100 seconds)
3./proc/sys/Vm/dirty_ratio
This file indicates the percentage of dirty data generated by the process to the overall memory of the system. At this time, the process writes the dirty data back to the disk.
Default setting: 40
4./proc/sys/Vm/dirty_writeback_centisecs
This file indicates how long the pdflush process writes dirty data back to the disk at a periodic interval.
Default setting: 500 (1/100 seconds)
4.2. The system generally writes the dirty page back in the following three cases:
1. scheduled mode: scheduled write-back is based on the principle that the value of/proc/sys/Vm/dirty_writeback_centisecs indicates how long the write-back thread will be started, the write-back thread started by this timer is only written back to the page in the memory where the dirty time exceeds (/proc/sys/Vm/didirty_expire_centisecs/100) seconds (the default value is 3000, that is, 30 seconds). Generally, the value of dirty_writeback_centisecs is 500, that is, 5 seconds. Therefore, by default, the system starts a write-back thread in five seconds, to write back pages whose dirty time exceeds 30 seconds, note that the write-back thread started in this mode only writes back the dirty pages that have timed out and does not write back.
On the dirty page without timeout, you can modify the values in/proc and view the kernel function wb_kupdate in details.
2. When the memory is insufficient: At this time, not all dirty pages are written to the disk, but about 1024 pages are written each time until the idle pages meet the requirements.
3. during the write operation, dirty pages are found to exceed a certain percentage: when the proportion of dirty pages in the system memory exceeds/proc/sys/Vm/dirty_background_ratio, the Write System Call will wake up pdflush and write back the dirty page until the dirty page proportion is lower than/proc/sys/Vm/dirty_background_ratio. However, the Write System Call will not be blocked and will be returned immediately. when the ratio of dirty pages to system memory exceeds/proc/sys/Vm/dirty_ratio, The Write System Call will be blocked and the dirty page will be written back, until the proportion of dirty pages is lower than/proc/sys/Vm/dirty_ratio
4.3. Feelings in the Pb project:
1. If the write volume is huge and you cannot expect the automatic refresh mechanism from the system cache, you 'd better use the application layer to call fsync or sync. If the write volume is large, or even exceeds the speed when the system cache is automatically flushed back, the dirty page rate of the system may exceed/proc/sys/Vm/dirty_ratio. At this time, the system will block subsequent write operations. This blocking may take 5 minutes, which is unacceptable to our applications. Therefore, a recommended method is to call fsync at the appropriate time at the application layer.
2. For key performance, it is best not to rely on the function of the system cache. If the performance requirement is high, it is best to implement the cache on the application layer, because the system cache is too much affected by the outside world, maybe the system cache will be washed away.
3. In the logic design, it is found that the implementation of system cache is very suitable. For the high-rise stickers in logic, the implementation of cache at the application layer is very complicated, and the number of it is very small, this part of requests can depend on the system cache, but must be used with the application layer cache. the application layer cache can cache a majority of non-high-rise POST requests, the whole program's Io to the system is mainly in the tall building. In this case, the system cache can achieve good results.
5. disk pre-read
For pre-reading, extract the following two paragraphs from the Internet:
Overview of pre-read Algorithms
1. Sequential Detection
To ensure the pre-read hit rate, Linux only performs pre-Read on sequential read. The kernel verifies the following two conditions to determine whether a read () is read sequentially:
- This is the first read after the file is opened, and the first read is the file;
- The position of the current read request and the previous (recorded) read request in the file is continuous.
If the above sequence conditions are not met, it is determined to be random read. Any random read operation terminates the current sequence, thus terminating the pre-read behavior (rather than reducing the pre-read size ). Note that the spatial sequence here refers to the offset in the file, rather than the continuity of the physical disk sector. Here, Linux is simplified. The basic premise of its effectiveness is that files are stored continuously on disks without serious fragmentation.
2. Pipeline pre-read
When the program is processing a batch of data, we hope that the inner nuclear energy will prepare the next batch of data in advance in the background so that the CPU and hard disk can work in the pipeline. Linux uses two pre-read windows to track the pre-read status of the current Sequential stream: current window and ahead window. The ahead window is prepared for the pipeline: when the application is working in the current window, the kernel may be conducting asynchronous pre-read at the ahead window; once the program enters the current ahead window, the kernel will immediately push forward two windows and start pre-read I/O in the new ahead window.
3. Pre-read size
When the sequential readahead is determined, the appropriate pre-read size needs to be determined. If the Pre-read granularity is too small, the expected performance improvement will not be achieved; too many pre-reads may load too many pages that are not required by the program, resulting in a waste of resources. Therefore, linux adopts a rapid window expansion process:
First pre-read: readahead_size = read_size * 2; // or * 4
The initial value of the pre-read window is two to four times the read size. This means that using a large read granularity (such as 32 KB) in your program can slightly improve I/O efficiency.
Subsequent pre-read: readahead_size * = 2;
The pre-read window will multiply one by one until the maximum pre-read size set by the system is reached. The default value is kb. This default value has been in use for at least five years and is too conservative in the face of faster hard disks and large memory capacity.
# Blockdev-setra 2048/dev/SDA
Of course, the larger the pre-read size is, the better. In many cases, I/O latency also needs to be considered.
6. Other details:
6.1. pread and pwrite
In multi-threaded Io operations, use pread and pwrite as much as possible for Io operations. Otherwise, if you use seek + write/read, you need to lock the operation. This lock will directly cause multi-thread operations on the same file to be serialized at the application layer. As a result, the benefits of multithreading are eliminated.
Using the pread method, multithreading is much faster than a single thread. It can be seen that the pread system calls are not blocked by the same file descriptor. In the underlying implementation of pread and pwrite, how does one implement the same file descriptor without affecting each other? The main reason why multithreading is higher than single-threaded iops is the scheduling algorithm. Pread is a secondary factor when multithreading is used.
The kernel does not use inode semaphores when executing the pread system call. This prevents a thread from blocking other threads when reading files. However, system calls of pwrite use inode semaphores, multiple Threads compete at the inode semaphore. Pwrite only writes data to the cache and returns the result. The time is very short, so the competition is not very strong.
6.2. Do I need multiple sets of file descriptors?
When pread/pwrite is used, can I further improve I/O performance if each read/write thread uses its own set of file descriptors?
Each file descriptor corresponds to a file object in the kernel, and each file corresponds to an inode object. Assume that a process opens the same file twice and obtains two file descriptors. In the kernel, there are two file objects, but only one inode object. The inode object completes the read and write operations on the file. Therefore, if the read and write threads open the same file, even if they use their exclusive file descriptors, they will eventually act on the same inode object. Therefore, I/O performance will not be improved.