The service system generally has a strict time-out, for the business unit to troubleshoot those Burr slow response, but also the infrastructure Department of the expert sitting in one of the services.
Sometimes, even if your code is working hard, it still responds slowly because it's a tough world. This article gives some examples from three directions:
The first aspect is mainly the warm-up, the more interesting two aspects see the second episode.
The first aspect, the operating system article
Preparation knowledge: "From the Apache Kafka review the file efficiently read and write" in Swap and pagecache parts. 1. Disable Swap
Linux has a very strange hobby, when the memory is not enough, to see the mood, there is a great chance not to use as the IO cache page cache back, but the cold application memory page out to disk (the specific algorithm to see the preparation of knowledge). When this memory is being accessed again, the process is stalled by putting it back in the memory (the so-called Main page fault). The slow growth of the generation, the pool of external memory, may be considered cold memory, with Cat/proc/[pid]/status to see the size of Vmswap, and then dstat to see the monitoring page in the occurrence of time.
In/etc/sysctl.conf put the following sentence, basically can eliminate swap. Set into 0 will lead to Oom, case in this, some students are set to 1, like it.
Vm.swappiness = 10
2. Speed up the page Cache flush frequency
is also a strange Linux own settings, Linux page cache mechanism is a long story (or look at the preparation of knowledge), simply said IO is not the default is not to write the disk, but write into the page cache memory, the inode dirty 30 seconds, or dirty data reached 10% available memory ( FREE+PAGECACHE-MMAP) before starting the flusher thread write disk.
Our production machine memory is at least 20G, want to ordinary hard disk 100mb/s level speed, write 2G file speed .... Fortunately, generally not up to this condition, usually by a few log files take 30 seconds to trigger, write hundreds of m at a time, spend three seconds or so.
The article said that the background brush disk thread does not block application write (2). But
The application of the write process is this:
Lock inode-> Lock page-> write page-> unlock page-> unlock inode-> lock inode page-> write inode page-> unlock inode page
And the process of flusher is this:
Lock page-> the page into the IO queue, waiting for IO to dispatch the write disk to complete the return-> unlock page
Visible, still have locks, IO Scheduler is not absolutely fair, when IO busy, application will still occur blocking.
Our approach is to use a 100MB absolute value instead of the available memory percentage for the threshold.
Join in the/etc/sysctl.conf.
Vm.dirty_background_bytes = 104857600
In the second example of a pause in the JVM, full IO-related, even if the JVM is not tuned, the light lowers the threshold, which can be greatly alleviated.
Of course, what value is optimal, must be based on machine configuration, the application of characteristics to specific analysis.
3. Network Parameters
Too many configurable places, you can refer to the Aliyun team for a good article on Linux TCP queue related parameters summary. Or that sentence, can not look at the article began to set up, must be based on their own situation.
For example, we set the CPU affinity of the network card soft interrupt queue:
At ordinary times the network card interrupts may only use one core to respond, under the big flow that nucleus will run full.
Run Irqbalance, also only used 1 cpu,12 nucleus.
Finally set their own 24 network card interrupt queue corresponding to 24 cores, the best effect ... But your situation is not necessarily the same.