In the previous article, we already know how Vsan handles capacity devices and cache device disk failures, so what happens if a Vsan host fails? Let's take a look at the following picture:
This situation is slightly different from disk failure. In the event of a disk failure, VSAN will notice what is happening, and it will notice that the disk cannot be recovered, triggering component refactoring. However, when a host failure occurs, VSAN does not notice what is happening. This fault state is called "not present." Once VSAN notices that the component (VMDK in the example above) does not exist, the timer starts 60 minutes. If the component resumes within 60 minutes, the Vsan synchronizes the mirrored copy. If the component cannot be recovered, the VSAN creates a new mirrored copy. Note that you can change the advanced settings by "VSAN." Clomrepairdelay "To reduce this timeout value.
If the original failed host recovers and rejoin the cluster, the Vsan checks the object's refactoring state. If the object has already been refactored on one or more nodes, there will be no other action. If the object refactoring is still in progress, the components of the original failed host will still be resynchronized to prevent problems with the new component. When all objects are synchronized, the original host's components are discarded, and the newly created copy is enabled. However, if the new component fails to complete the synchronization for any reason, the original component on the original host will continue to be used.
Note: When a host fails, all virtual machines running on it are restarted by vsphere ha. vsphere ha may restart virtual machines on any available host in the cluster, regardless of whether the hosts have Vsan components.
Supplementary optimization information: (from http://blog.51cto.com/roberthu/2049330)
vsan6.2 Advanced Parameter optimization
Esxcfg-advcfg-s 1024/lsom/heapsize
Esxcfg-advcfg-s 180/VSAN/CLOMMAXCOMPONENTSIZEGB
Esxcfg-advcfg-s 512/lsom/blplogcachelines Default value is + K, increased to
Esxcfg-advcfg-s 32/lsom/blllogcachelines Default value of 128, increased to + K
- This parameter must be modified before the host formally deploys the virtual machine
Appendix Learning:
The meaning of congestion indication
Congestion is a feedback mechanism that reflects the reduced rate of inbound IO requests that are served at the level of incoming Vsan disk groups from the Vsan DOM client layer. This low inbound IO request rate behavior is caused by IO latency, and the underlying bottleneck causes an IO delay. Therefore, one effective way is to transfer latency from the underlying to the input traffic without changing the total throughput of the system. This avoids unnecessary queueing and trailing queues in the VSAN lsom layer, and avoids wasting a lot of CPU cycles while processing the IO requests that may eventually be discarded. Therefore, no matter what type of congestion, temporary and small congestion values are usually no problem, but not for the system performance. However, persistent and large congestion values can lead to latency increases and throughput reductions exceeding expectations, so you should focus on and address them to improve benchmark performance.
Reporting method of congestion
VSAN measures and reports congestion in a scalar value between 0 and 255. The IO latency introduced increases exponentially with the congestion value.
Possible ways to handle congestion
Check that congestion is persistent and high (> 50). In many cases, high congestion values are caused by system misconfiguration or poor system performance. If the high congestion value is always present, check the following: The maximum queue depth that is supported in the
1.IO controller and device. A supported maximum queue depth of less than 100 may cause problems. Please check that the controller is certified and tied to the VSAN HCL list.
2. Incorrect version of firmware or device driver software. Refer to the VMware HCL for VSAN-compatible software.
3. Incorrect size setting. Incorrectly setting the cache layer disk and memory size may result in higher congestion values.
If the problem is not any of these conditions, you must debug to determine if you can better adjust the benchmark to reduce congestion. You must note that:
4. All disk groups are congested, or
5. The congestion value of one or two disk groups is abnormally higher than the other disk groups.
for Case (1), it is very likely that the VSAN cluster backend cannot process IO workloads. If possible, you can adjust the datum by:
6. Close some virtual machines or
7. Reduce the number of outstanding io/threads in each virtual machine, or
8. For write workloads, reduce the size of the working set.
for Case (2), that is, the congestion on one disk group is much higher than the other disk groups in the system, which indicates an imbalance in write IO activity between disk groups. If this continues to occur, try increasing the number of disk bands in the VSAN Storage policy used to create the virtual machine disks.
Common types of congestion reported and how to troubleshoot
the types of congestion and the remedies for each type are listed below:
9.SSD congestion: SSD congestion is typically caused when the active working set of the write IO for a specific disk group is significantly larger than the size of the disk group cache layer. In mixed and all-flash VSAN clusters, data is first written to the write cache (also known as write buffers). A process called a degraded dump moves data from the write buffer to the capacity disk. The write cache is subjected to a high write rate, ensuring that write performance is not limited by capacity disks. However, if the benchmark fills the write cache at a very fast rate, the degraded dump process may not keep up with the arrival IO rate. In this case, an SSD congestion is raised to instruct the Vsan DOM client layer to slow IO down to the rate at which the Vsan disk group can process.
Remedial action: To avoid SSD congestion, adjust the size of the virtual machine disk used by the benchmark. For best results, we recommend that the size of the virtual machine disk (active working set) not exceed 40% of the cumulative size of all disk group write caches. Note that for hybrid VSAN clusters, the write cache size is 30% of the cache-tier disk size. In an all-flash cluster, the size of the write cache is the size of the cache-tier disk, but should not exceed GB.
2. Log congestion: The VSAN lsom log (metadata that stores IO operations that do not downgrade dumps) consumes a large amount of space in the write cache, which typically causes log congestion.
Typically, a large number of small writes on a small working set can result in a large number of VSAN Lsom log entries, which can cause this type of congestion to occur. Additionally, if the benchmark does not emit a 4K aligned IO, the number of IO on the VSAN stack increases, causing 4K alignment. An increase in the number of IO may cause log congestion.
Remedy: Check that the baseline is consistent with the IO request on the 4K boundary. If not, check that the baseline uses a very small working set (the working set is considered smaller if the total size of the access virtual machine disk is less than 10% of the cache layer size.) See above for information on how to calculate the cache layer size). If so, increase the working set to 40% of the cache layer size. If none of the above two conditions are true, write traffic will need to be reduced by the following two methods: reduce the number of outstanding iOS that are issued by the benchmark or reduce the number of virtual machines created by the benchmark.
3. Component congestion: This congestion indicates that there are a large number of uncommitted commit operations due to the IO requests queued for some components. This may result in an extended latency period. Typically, a large number of writes to several virtual machine disks can cause this congestion.
Remediation: Increase the number of virtual machine disks used by the benchmark. Ensure that the benchmark does not issue IO to a small number of virtual machine disks.
4. Memory and Slab congestion: memory and Slab congestion often means that the VSAN Lsom layer uses insufficient heap memory space or Slab space to maintain its internal data structures. VSAN will provision a fixed amount of system memory for internal operations. However, if the benchmark is aggressively emitting IO without any restrictions, it may cause the vSAN to light up all of the memory space allocated to it.
Remedial action: Reduce the working set of the benchmark. Alternatively, increase the following settings to increase the amount of memory reserved for the VSAN Lsom layer when you experience the baseline. Note that these settings are for each disk group. In addition, we do not recommend that you use these settings on a production cluster. You can change these settings through ESXCLI (see Knowledge Base article 1038578) as follows:
/lsom/blplogcachelines, the default value is + K, increased to
/lsom/blploglsncachelines, default value is 4 K, adjusted to K
/lsom/blllogcachelines, default value of 128, increased to + K
Host fault and optimization for Vsan