3.4 Disk bottlenecks Disk bottleneck
The disk subsystem is often the most important aspect of server performance and is usually the most common bottleneck.
However, problems can be hidden by other factors, such as lack of memory.
Applications are considered to be I/O-bound when CPU cycles are wasted simply waiting for I/O tasks to finish.
The disk subsystem is the most important aspect of server performance, and is usually the most common bottleneck;
However, its problems are usually hidden under other factors, such as the lack of memory;
When the CPU cycle is wasted waiting for the completion of I/O tasks, such applications are considered as I/O-intensive;
The most common disk bottleneck is having too few disks.
Most disk configurations are based on capacity requirements, not performance.
The least expensive solution is to purchase the smallest number of the largest capacity disks possible.
However, this places more user data on each disk,
Causing greater I/O rates to the physical disk and allowing disk bottlenecks to occur.
The most common disk bottleneck is that the hard disk is too small;
Most disk configurations are based on capacity requirements rather than performance;
The cheapest solution may be a small number of large-capacity disks;
The second most common problem is having too handle logical disks on the same array.
This increases seek time and significantly lowers performance.
The disk subsystem is discussed in 4.6, "Tuning the disk subsystem" on page 112.
The second common problem is that there are too many logical disks on the same display,
This increases tracing time and significantly reduces performance;
3.4.1 Finding disk bottlenecks
Find disk bottlenecks
A server exhibiting the following symptoms might be suffering from a disk bottleneck (or a hidden memory problem ):
If the server encounters the following symptoms, it may be a disk bottleneck (or implicit memory problem ):
. Slow disks will result in Slow disk will cause:
-Memory buffers filling with write data (or waiting for read data), which will delay all
Requests because free memory buffers are unavailable for write requests (or
Response is waiting for read data in the disk queue ).
Memory buffer will be filled with write data (or waiting for read data), which will lead to slow response,
Because there is no available memory buffer for write requests (or the response is waiting to read data from the disk Queue ).
-Insufficient memory, as in the case of not enough memory buffers for network requests,
Will cause synchronous disk I/O.
Insufficient memory. If there is not enough memory buffer for network requests, disk I/O synchronization will occur;
. Disk utilization, controller utilization, or both will typically be very high.
Disk utilization, controller utilization, or both;
. Most LAN transfers will happen only after disk I/O has completed, causing very long
Response times and low network utilization.
Most LAN transmission will occur after disk I/O is complete, resulting in a very long response time and a low network selection rate;
. Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will
Be idle or have low utilization because they wait long periods of time before processing
Next request.
Disk I/O consumes a long time and causes the disk queue to become full. Therefore, the CPU usage becomes idle or low,
Because they have to wait a long time before processing the next response;
The disk subsystem is perhaps the most challenging subsystem to properly configure.
Besides looking at raw disk interface speed and disk capacity, it is also important to understand the workload.
Is disk access random or sequential? Is there large I/O or small I/O?
Answering these questions provides the necessary information to make sure the disk subsystem is adequately tuned.
The disk subsystem may be the most challenging subsystem for proper configuration;
In addition to the interface speed and hard disk capacity of the original disk, the workload is equally important;
Is disk access random or sequential? Is it heavy I/O or light I/O?
Only after answering these questions can we provide necessary information to adjust the disk subsystem;
Disk manufacturers tend to showcase the upper limits of their drive technology's throughput.
However, taking the time to understand the throughput of your workload will help you
Understand what true expectations to have of your underlying disk subsystem.
Disk manufacturers tend to show off the maximum throughput of their drive technology,
However, understanding the throughput of your workload will help you understand your real expectations for the disk subsystem;
Random read/write workloads usually require several disks to scale.
The bus bandwidths of SCSI or fiber Channel are of lesser concern.
Larger databases with random access workload will benefit from having more disks.
Larger SMP servers will scale better with more disks.
Given the I/O profile of 70% reads and 30% writes of the average extends cial workload,
A RAID-10 implementation will perform 50% to 60% better than a RAID-5.
A random read/write workload usually requests expansion of multiple disks;
The bandwidth or fiber-RE of SCSI bus is usually not very concerned;
Large databases with random access are advantageous in the case of multiple disks;
The expansion of large SMP servers is better than that of multiple disks;
If the average workload is 70% reads and 30% writes, RAID-10 is 50% more efficient than RAID-5 ~ 60%;
Sequential workloads tend to stress the bus bandwidth of disk subsystems.
Pay special attention to the number of SCSI buses and Fibre Channel controllers when maximum throughput is desired.
Given the same number of drives in an array, RAID-10, RAID-0,
And RAID-5 all have similar streaming read and write throughput.
Sequential workloads put a lot of pressure on the bus bandwidth and disk subsystems;
If there are the same number of disks on the display, RAID-10, RAID-0, and RAID-5 have the same stream read/write throughput;
There are two ways to approach disk bottleneck analysis:
Real-time monitoring and tracing.
There are two methods to analyze disk bottlenecks:
Real-time Monitoring and tracking;
. Real-time monitoring must be done while the problem is occurring. This might not be
Practical in cases where system workload is dynamic and the problem is not repeatable.
However, if the problem is repeatable, this method is flexible because of the ability to add
Objects and counters as the problem becomes clear.
Real-time Monitoring must be performed when a problem occurs, which is dynamic in the system load and difficult to operate when the problem cannot be re-identified,
However, this method is useful if the problem can be reproduced;
. Tracing is the collecting of performance data over time to diagnose a problem. This is
Good way to perform remote performance analysis. Some of the drawbacks include
Potential for having to analyze large files when performance problems are not repeatable,
And the potential for not having all key objects and parameters in the trace and having
Wait for the next time the problem occurs for the additional data.
Tracking is to collect performance data for a long time to diagnose problems. For remote performance analysis, this is a good method;
However, this method also has some disadvantages. When analyzing large files, performance issues cannot be repeated;
Important objects and parameters are not tracked;
Vmstat command
One way to track disk usage on a Linux system is by using the vmstat tool. The important
Columns in vmstat with respect to I/O are the bi and bo fields. These fields monitor
Movement of blocks in and out of the disk subsystem. Having a baseline is key to being able
To identify any changes over time.
The hard disk tracking tool in Linux can use vmstat,
The most important columns are bi and bo. They monitor the movement of input and output blocks with disk subsystems,
It is important to have a baseline to mark changes;
Iostat command
Performance problems can be encountered when too program files are opened, read and written
To, then closed repeatedly. This cocould become apparent as seek times (the time it takes
Move to the exact track where the data is stored) start to increase. Using the iostat tool, you
Can monitor the I/O device loading in real time. Different options enable you to drill down even
Deeper to gather the necessary data.
Iostats can be used to monitor the load of real-time I/O devices. Different options can be further analyzed;
Example 3-3 shows a potential I/O bottleneck on the device/dev/sdb1. This output shows
Average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.
For a more detailed explanation of the fields, see the man page for iostat (1 ).
Changes made to the elevator algorithm as described in 4.6.2, "I/O elevator tuning and
Selection "on page 115 will be seen in avgrq-sz (average size of request) and avgqu-sz
(Average queue length). As the latencies are lowered by manipulating the elevator settings,
Avgrq-sz will decrease. You can also monitor the rrqm/s and wrqm/s to see the effect on
Number of merged reads and writes that the disk can manage.
For more details, see the man page of iostat;
3.4.2 Performance tuning options
Performance adjustment options
After verifying that the disk subsystem is a system bottleneck, several solutions are possible.
Multiple solutions are available when you confirm that the disk subsystem is a system bottleneck;
These solutions include the following:
. If the workload is of a sequential nature and it is stressing the controller bandwidth,
Solution is to add a faster disk controller. However, if the workload is more random in
Nature, then the bottleneck is likely to involve the disk drives, and adding more drives will
Improve performance.
If the workload is sequential and the bandwidth pressure of the controller is high, the solution is to add a faster disk controller;
If the workload is random, the bottleneck may be the hard drive. Increasing the driver can improve the performance;
. Add more disk drives in a RAID environment. This spreads the data transmission SS multiple
Physical disks and improves performance for both reads and writes. This will increase
Number of I/OS per second. Also, use hardware RAID instead of the software
Implementation provided by Linux. If hardware RAID is being used, the RAID level is
Hidden from the OS.
Add more disk drivers in the RAID environment, which enables data to span multiple physical disks and improve read/write performance;
Likewise, replacing software RAID with hardware RAID also works;
. Consider using Linux logical volumes with striping instead of large single disks or logical
Volumes without striping.
Consider using the Linux strip logical volume;
. Offload processing to another system in the network (users, applications, or services ).
Detach a process to another system on the network;
. Add more RAM. Adding memory increases system memory disk cache, which in effect
Improves disk response times.
Adding more RAM and memory increases the system memory disk cache, which can improve the disk response time;