Impact of storage methods and media on Performance

Source: Internet
Author: User

From: http://stblog.baidu-tech.com /? P = 851

Summary

The data storage method has a great impact on the overall performance of the application. Is Data Access smooth or random? Are data stored on disks and stored on flash cards? Does multi-threaded read/write affect performance? How can we choose a variety of data storage methods? This article provides you with a performance test data in different storage modes, so that you can use this data to select an appropriate data storage mode during future program development.

Tag

Storage performance, InnoDB performance, storage medium

Directory

Directory... 1

Introduction... 1

Storage performance analysis... 2

Test procedure description :... 2

Store test data :... 2

MySQL InnoDB performance testing... 4

MySQL (InnoDB) disk flushing policy... 6

C/S mode communication performance... 6

Direct file storage... 7

File IO method... 7

Completely Random write or skip, 5 times performance gap... 8

Multithread random read, processing speed, response time... 9

System cache... 10

Kernel Parameters related to system cache... 10

Write back. 10 on the dirty page

Summary... 11

Introduction

The data storage method has a great impact on the overall performance of the application. Is Data Access smooth or random? Are data stored on disks and stored on flash cards? Does multi-threaded read/write affect performance? How can we choose a variety of data storage methods?

This article will conduct detailed performance tests on different storage methods to provide you with performance test data for different storage methods. It will also briefly introduce the performance differences between different storage methods.

Storage Performance Analysis

There are various reasons that affect the storage speed, including storage media, read/write disk methods, and the impact of the hardware environment on read/write disks. Here we will mainly share some research results on storage speed.

The hardware environment is as follows:

CPU: inter nehalem e5620 2.4ghzx2

Memory: PC-8500 4 GB * 8

Hard Disk: 300g 10 K * 2, raid: 1

Flash: SSD 160gb_mlc X25-M G2 x 6

NIC: Gigabit

Data volume: 117 GB

Test procedure description:

The test is divided into two sets of programs:

  1. A. Storage Test

A) the storage test program uses pread/pwrite for storage testing. The speed of the blockchain traversal is based on the block chain library developed by Fr.

B) to reduce the impact of system cache on random read/write

I. Increase the data volume to 117 GB

Ii. Each data is tested only once.

Iii. Clear memory at the program entrance

C) read and write all data at a time during sequential read/write tests.

D) during random read/write tests, 4 kb is read each time and 381 MB is read.

  1. B. Network Performance Testing

A) Use UB + ubrpc to implement the server and client of the Stress tool.

B) ubsvr_nodelay

C) Common IDL specifications

D) Test Two packet requests of different sizes.

Store test data:

Disk

Sequential read: 145.59 MB/S

Random read: 0.91 Mb/s (4 kb read each time, 381 MB read)

Sequential write: 83.1 MB/S

Random write: 0.34 Mb/s (4 kb for each write and 381 MB for each write)

Flash

Sequential read: 61.5 MB/S

Random read: 14.9 Mb/s (4 kb read each time, 381 MB read)

Sequential write: 59.8 MB/S

Random write: 1.93 Mb/s (4 kb for each write and 381 MB for each write)

Memory

Sequential write: 1655 MB/S

Random write: 1496 MB/S

Eg: block chain traversal speed: 10 million RMB, 565582 us

Sequential read/write Performance Comparison Between disks and flash cards (unit: MB ):

Comparison of random read/write performance between disks and flash cards (unit: MB ):

Compared with the random read/write performance of disks and flash cards, we can see that for write operations, the performance difference between disks and flash cards is small, in fact, the performance difference varies with the amount of data written at random write and other factors such as the flash block size. At the same time, in the flash with write optimization, when writing data to a flash card, the data is first written to a buffer. when certain conditions (such as buffer full) are met, the buffer data is flushed into flash, blocking write, this causes performance jitter. Therefore, when most of the application's operations are write operations, if there is no flash card, you can barely put the data on the disk.

However, from the test results, the random read performance on the disk and flash card is 8 times or more different, so when the program reads a considerable number of disk operations, putting data on the flash card is a good choice. For example, for a random query of a large number of database applications, we can consider putting the database storage files on the flash card.

On the other hand, we can intuitively see that the speed on the disk is much higher than the speed on the flash card, regardless of sequential reading or sequential writing. Therefore, if the data required by the program is loaded from the disk at a time, the modifications to the data after the program is loaded are all memory operations and do not directly write the disk. When you need to write the disk, it is also a time to dump data in the memory to the disk. We should all put data on disks, rather than flash cards.

MySQL InnoDB Performance Test MySQL Test 1: Memory read

Hard Disk environment and Configuration:

Innodb_buffer_pool_size = 5120 m

Innodb_flush_log_at_trx_commit = 0

Machine memory: 32 GB

CPU: 4-core Intel (r) Xeon (r) CPU 5150 @ 2.66 GHz

Flash: 256 GB, SLC no write optimization, no raid.

Pressure:

Mysqlab ARGs:-uroot-proot-h127.0.0.1-p3306-DFR-Fr. SQL-C1-T40-s1000

50 threads, pressure of 1000req/s per Thread:

Conclusion:

Number of processes per second About 4700.
Max Response Time 139374 us
Average Response Time 8535us
   

Iostat:

From the system status,

Flash performance has reached its limit.

MySQL Test 2: read/write conflict Test
Read pressure Real pressure Write pressure Real write pressure
0 0 Single thread 1000 Around 300 (very fast at first, but will eventually fall down)
0 0 10 threads * 1000 About 300 (very fast at first, but it will fall down. After the fall, it will be very unstable)
40 threads * 80 2000 up and down Single thread 300 Average 60 +
40 threads * 50 1000 + floating up and down Single thread 300 Average 80 +
40 threads * 40 1500 Single thread 100 80 +

It can be seen that the flash read/write conflict is very serious. If the pressure is too high, it will lead to a crush. Instead, when the write pressure is constant at 100 times, it can support more than 1500 concurrent read requests, and relatively stable.

Online, write-optimized MLC is used most. To reduce read/write conflicts.

MySQL Test 3: MLC + write optimization flash

Read only:

Read pressure Real read Pressure Write pressure Real write pressure
40 threads * 1000 8300 + 0 0
40 threads * 80 2900 Single thread * 200 160-190
40 threads * 125 4000 + Single thread * 200 160 ~ 190

It can be seen that after MLC + write optimization, the performance of SLC is greatly improved.

It can support 4000 read + 160 write requests.

MySQL (InnoDB) disk flushing Policy

Some performance tests of MySQL inndb are done above. Here we will talk about the InnoDB flushing policy. The InnoDB Storage engine has a buffer pool (the size of the buffer pool can be changed to innodb_buffer_pool_size according to the configuration). When InnoDB reads data from the database files, it first reads the database files by page, then, the data in the buffer pool is retained Based on the LRU algorithm. When you need to modify the data, the data in the buffer is first modified, and the data in the buffer may be dirty data, innoDB then refreshes the dirty pages in the buffer pool to files at a certain frequency.

InnoDB determines whether to fl the disk based on the system I/O pressure and other factors. It will trigger a fl every second and a fl every 10 seconds. When the disk flushing is triggered every second, the ratio of dirty pages in the Current Buffer Pool exceeds the innodb_max_dirty_pages_pct parameter in the configuration file. In the disk flushing triggered every 10 seconds, InnoDB will fl at least 10 dirty pages (if any) to the disk, and the InnoDB engine will determine the proportion of dirty pages in the buffer pool, if a certain proportion is exceeded, 100 dirty pages will be flushed to the disk.

C/S mode communication performance

In this section, we use Baidu's UB box to write two network performance testing programs. It mainly tests the impact of data packet size on network services and network delay data between different data centers.

Program & hardware:

Nic bandwidth: 1000 MB

Communication Protocol: ubrpc

Configuration: 40 threads for persistent connections, epool, and Server

Same Data Center:

Request package size: 13000 Bytes/Pack

Single thread pressure: 1600/s, Nic traffic: 20.6 MB

8 line pressure: 9000/s, network card Traffic: 122 MB (basically up to the network card limit)

Request package size: 1700 Bytes/Pack

Single thread pressure: 2900/s, Nic traffic: 4.7 MB

36 lines of pressure: stable 28000/s, network card Traffic: 50.3 MB (for gigabit network cards, the bandwidth is still very abundant, but the cpu r value is already around 10. The pressure on the server is unstable.

Cross-IDC latency: 817us

Request package size: 1700 Bytes/Pack

Single thread pressure: 860/s, Nic traffic: 1.4 MB

In the same program, the performance of single-thread Stress across data centers is significantly reduced, less than 1/3 of the same data center. Each request has a latency of us.

Direct file storage I/O

Direct file storage refers to writing data in the memory directly to the disk. We usually write data using munmap, fwrite, pwrite, and write. Instead, use MMAP, fread, pread, and read to load data from disk files to the memory.

MMAP/munmap

1.MMAP/munmapIt is a memory ing technology used to map disk files to memory. Modifications to memory data are mapped back to the disk. the Linux kernel maintains a data structure, to establish the association between the region of the virtual address space and related data. The association between the file range and the address space mapped to it is completed through the priority tree, as shown in (1.1. MMAP technology has a very good performance in both speed and ease of use.

Fig 1.1Struct file refers to a file handle opened with open. f_mapping is a pointer containing the address_space struct pointing to inode and priority tree. It is used for memory ing.

Pwrite/Write

Next, let's talk aboutPwrite and writeThey all belong to the file IO, and the data stream is from "process => FD => File". They all directly call the functions called by the system. The difference between the two is that pwrite is equivalent to calling lseek and write sequentially, and then calling pread cannot interrupt its location and read operations, that is, lseek and write are equivalent to atomic operations; another point is that pwrite does not update the object pointer.

In multi-threaded Io operations, use pread and pwrite as much as possible for Io operations. Otherwise, if you use seek + write/read, you need to lock the operation. This lock will directly cause multi-thread operations on the same file to be serialized at the application layer. As a result, the benefits of multithreading are eliminated.

Using the pread method, multithreading is much faster than a single thread. It can be seen that the pread system calls are not blocked by the same file descriptor. In the underlying implementation of pread and pwrite, how does one implement the same file descriptor without affecting each other? The main reason why multithreading is higher than single-threaded iops is the scheduling algorithm. Pread is a secondary factor when multithreading is used.

The kernel does not use inode semaphores when executing the pread system call. This prevents a thread from blocking other threads when reading files. However, system calls of pwrite use inode semaphores, multiple Threads compete at the inode semaphore. Pwrite only writes data to the cache and returns the result. The time is very short, so the competition is not very strong.

When pread/pwrite is used, can I further improve I/O performance if each read/write thread uses its own set of file descriptors?

Each file descriptor corresponds to a file object in the kernel, and each file corresponds to an inode object. Assume that a process opens the same file twice and obtains two file descriptors. In the kernel, there are two file objects, but only one inode object. The inode object completes the read and write operations on the file. Therefore, if the read and write threads open the same file, even if they use their exclusive file descriptors, they will eventually act on the same inode object. Therefore, I/O performance will not be improved.

Pwrite/fwrite

Finally, let's talk about it.Pwrite/fwrite. Although their functions are to store data in the memory into files. However, the principle and process are different. We have just said that pwrite is a file IO, and the data stream is from "process => FD => file", while fwrite is a stream/standard Io, and its data stream is from "process = & gt; FP (file object) => stream/buffer => file "; originally, operations on the file are directly performed, and the operations on the stream object are changed in the fwrite library function, the operations at the "stream => file" layer will be completed by the library function. The logical representation of a stream is a file object, and the stream entity is the buffer used by the stream. These buffers represent files relative to the application process.

Completely Random write or skip, 5 times performance gap

Full random write is undoubtedly the slowest write method. During the logic dump test, we were surprised to find that MB of memory data was randomly written into GB of disk data, it took as much as two hours. The reason is that although only 2 million MB of data is available, it is actually 2850 random writes. According to the test, on 150 machines, such a completely random write, R/S is about ~ Between 350, R/S is hard to reach 180 on 250 machines, so it is no wonder that 2 ~ 3 hours.

How can we improve the slow random write speed of a single thread. One way is to convert completely random writes into ordered skip random writes. The implementation method can be simply caching in the memory for a period of time and then sorting, so that when writing a disk, it is not completely random, but that the disk head is moved only to one direction. According to the test, I was shocked again. Simply sorting in the memory directly reduced the write disk Time To 1645 seconds, and the disk R/S increased to more than 1000. The speed of writing a disk has suddenly increased by 5 times.

It should be noted that this skip write improves the performance to move in a single direction with the head, which is very vulnerable to other factors. In the test, the test mentioned above only writes block files, but if you add a small file that writes index in the processing of each tid. Although it takes almost no time to write only the index small file, if it is included in the middle of the block file, it may have a huge impact on the overall write performance, because it may make the Disk Head need to run back and forth between the two places. According to the test, if you only write the index file, you only need 2 million s to write all tids. If you put the write index and write block together, the total time is much greater than the sum of the two parts respectively. In this case, one solution is not to fl the disk with small data volumes in real time and use the application layer cache to cache indexes with small data volumes, this eliminates the impact on block file writing.

In principle, the above representation is explained. Generally, the process of reading data from a hard disk is like this. First, move the head to the area where the data on the disk is located before reading the data. The process of moving the head can be divided into two steps. One is to move the head to the specified track, that is, to find the track. This is a step to move the disk in the radial direction, the time spent is called "Seek time". The second is to rotate the disk to the corresponding sector. The time spent is called "latent time" (also known as latency ). That is to say, before reading data on the hard disk, the time required for preparation is mainly the sum of "Seek time" and "latent time. The real data read time is a fixed value determined by the data size, disk density, and disk speed. It cannot be changed at the application layer, however, the lack of application layer can reduce the "Seek time" and "latent time" by changing the access mode to the disk ",
The Method of Using Cache and sorting at the application layer is undoubtedly to shorten the disk addressing time. Because the head is a physical device, it is easy to understand why reading and writing other small files in the middle slows down a lot.

Suggestion: Avoid completely random writing as much as possible. When multi-line processing is not available, use the application layer cache as much as possible to ensure that the writing disk is sequential as possible. Other files with small data volumes can be stored in the application layer cache all the time to avoid impact on writing other large data volumes.

Multithread random read, processing speed, response time

The processing speed of multi-thread random read can be more than 10 times that of single-thread random read, but the same process also increases the response time. The test conclusion is as follows: (each thread should read as much as possible)

Number of read threads Read 100 times (UM) Average read time (UM)
1 1329574 13291
5 251765 12976
10 149206 15987
20 126755 25450
50 96595 48351

The conclusion indicates that increasing the number of threads can effectively improve the overall Io processing speed of the program. At the same time, the response time of each Io request increases significantly.

The underlying implementation explains this phenomenon: IO requests at the application layer are added to the IO Request queue in the kernel state. When processing IO requests, the kernel does not simply process them first, but uses an elevator algorithm based on the characteristics of the disk. After processing an IO request, the nearest Io request is processed first. This can effectively reduce the disk seek time, thus improving the overall Io processing speed of the system. However, for each Io request, the response time may be improved because it may need to wait in the queue.

The response time is increasing mainly because every thread should be read as much as possible during the test. In practical applications, our programs are not under this pressure. Therefore, in programs with IO becoming a bottleneck, we should try to use multiple threads to process different requests in parallel. The selection of the number of threads also needs to be measured through performance tests.

System cache-related Kernel Parameters
  1. /Proc/sys/Vm/dirty_background_ratio
    This file indicates the percentage of dirty data in the system's overall memory. At this time, the pdflush process is triggered to write the dirty data back to the disk.
    Default setting: 10
  2. /Proc/sys/Vm/dirty_expire_centisecs
    This file indicates that if the residence time of dirty data exceeds this value in the memory, the pdflush process will write the data back to the disk next time.
    Default setting: 3000 (1/100 seconds)
  3. /Proc/sys/Vm/dirty_ratio
    This file indicates the percentage of dirty data generated by the process to the overall memory of the system. At this time, the process writes the dirty data back to the disk.
    Default setting: 40
  4. /Proc/sys/Vm/dirty_writeback_centisecs
    This file indicates how long the pdflush process writes dirty data back to the disk at a periodic interval.
    Default setting: 500 (1/100 seconds)
Write back on the dirty page

The system usually writes the dirty page back in the following three cases:

  1. Scheduled mode: scheduled write-back is based on the principle that the value of/proc/sys/Vm/dirty_writeback_centisecs indicates how long the write-back thread will be started, the write-back thread started by this timer only writes back to the page in the memory where the dirty time exceeds (/proc/sys/Vm/didirty_expire_centisecs/100) seconds (the default value is 3000, that is, 30 seconds). Generally, the value of dirty_writeback_centisecs is 500, that is, 5 seconds. Therefore, by default, the system starts a write-back thread in five seconds, note that the write-back thread started in this mode only writes back the dirty page that has timed out and does not write back the dirty page that has not timed out, you can view the kernel function wb_kupdate by modifying the values in/proc.
  2. When the memory is insufficient: At this time, not all dirty pages are written to the disk, but about 1024 pages are written each time until the idle pages meet the requirements.
  3. During the write operation, dirty pages are found to exceed a certain percentage: when the proportion of dirty pages in the system memory exceeds/proc/sys/Vm/dirty_background_ratio, the Write System Call will wake up pdflush and write back the dirty page until the dirty page proportion is lower than/proc/sys/Vm/dirty_background_ratio. However, the Write System Call will not be blocked and will be returned immediately. when the ratio of dirty pages to system memory exceeds/proc/sys/Vm/dirty_ratio, The Write System Call will be blocked and the dirty page will be written back, until the proportion of dirty pages is lower than/proc/sys/Vm/dirty_ratio
Summary

This article provides you with a performance test data in different storage modes, so that you can use this data to select an appropriate data storage mode during future program development. At the same time, I/O reads and writes to files and system cache problems are described.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.