Troubleshoot low Erlang scheduler CPU utilization

Source: Internet
Author: User
-Cause

In the last group of services recently, the CPU usage of some node servers is very low, with only 1/4 of other servers. Eliminate service imbalance and suspected it was a system top statistics error. We found it through Erlang: Statistics (scheduler_wall_time) from the Utilization Survey of Erlang scheduler) check that the actual CPU usage of the machine scheduler with low server CPU is close to 100%, and that of other machines is less than 30%.

By analyzing different service services, we found that the CPU utilization of the scheduler is low only when the number of processes in the node is low.

 

-WhatsApp case

There are not many cases in Erlang. Fortunately, WhatsApp provides a detailed analysis of similar cases:

Http://highscalability.com/blog/2014/2/26/the-whatsapp-architecture-facebook-bought-for-19-billion.html
First bottleneck showed up at 425 K. system ran into a lot of contention. work stopped. instrumented the scheduler to measure how much useful work is being done, or sleeping, or spinning. under load it started to hit sleeping locks so 35-45% CPU was being used into ss the system but the schedulers are at 95% utilization.
WhatsApp encounters a bottleneck when connecting to a single-host 425k. The VM is only 35 ~ 45% CPU causes 95% CPU to the system. No details are mentioned in this Article. About Scheduler: 1. + SWT low set the scheduler wake up threshold to low because schedulers wocould go to sleep and wocould never wake up2. set the process priority to real-time run beam at real-time priority so that other things like Cron jobs don't interrupt schedule. prevents glitches that wocould cause backlogs of important user traffic3. disable spin (patch beam) patch to dial down spin counts so the scheduler wouldn't spin, + SSCT 1 (via patch; scheduler spin count)

 

-Tool Analysis

Through Weibo's private message, I consulted Zheng siyao and recommended vtune analysis. It is assumed that the scheduler consumes too much for a large number of processes.

Enter the registration information on the intel official website. An email will be sent immediately and a 30-day trial period will be given.

The download speed is very slow. We recommend that you download VPNs. The command line mode of the Linux version of vtune is easy to use:

Tar-zxf vtune_amplifier_xe_2015.tar.gz

CD vtune_amplifier_xe_2015

./Install. Sh

CD/opt/Intel/vtune_amplifier_xe_2015.1.0.367959/

Source amplxe-vars.sh

Amplxe-Cl-collect lightweight-hotspots-run-pass-thru = -- no-altstack-target-pid = 1575

Amplxe-Cl-Report hotspots

You can run the command online without affecting the normal operation of the service. The following result is displayed:

Summary-------Elapsed Time:       19.345CPU Time:           182.023Average CPU Usage:  9.155CPI Rate:           1.501Function                                     Module              CPU Time:Self-------------------------------------------  ------------------  -------------sched_spin_wait                              beam.smp                   72.754raw_local_irq_enable                         vmlinux                    19.282process_main                                 beam.smp                   10.476ethr_native_atomic32_read                    beam.smp                    8.337[email protected]0xffffffff8100af60                      vmlinux                     3.007__pthread_mutex_lock                         libpthread-2.12.so          2.342raw_local_irq_restore                        vmlinux                     1.973__sched_yield                                libc-2.12.so                1.913pthread_mutex_unlock                         libpthread-2.12.so          1.553__audit_syscall_exit                         vmlinux                     1.192system_call                                  vmlinux                     1.156erts_thr_yield                               beam.smp                    1.114handle_delayed_dealloc                       beam.smp                    0.977update                                       beam.smp                    0.828raw_local_irq_enable                         vmlinux                     0.780

 

We can see that sched_spin_wait occupies 40% of the CPU time.

#define ERTS_SCHED_SPIN_UNTIL_YIELD 1002121 static erts_aint32_t2122 sched_spin_wait(ErtsSchedulerSleepInfo *ssi, int spincount)2123 {2124     int until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD;2125     int sc = spincount;2126     erts_aint32_t flgs;21272128     do {2129     flgs = erts_smp_atomic32_read_acqb(&ssi->flags);2130     if ((flgs & (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING))2131         != (ERTS_SSI_FLG_SLEEPING|ERTS_SSI_FLG_WAITING)) {2132         break;2133     }2134     ERTS_SPIN_BODY;2135     if (--until_yield == 0) {2136         until_yield = ERTS_SCHED_SPIN_UNTIL_YIELD;2137         erts_thr_yield();2138     }2139     } while (--sc > 0);2140     return flgs;2141 }

The default value is spincount = 10000, but each time there is an atom read operation, the atomic operation generally takes dozens to hundreds of CPU cycles, resulting in a long wait for the actual execution.

Also find the corresponding Configuration:

  + Sbwt none | very_short | short | medium | long | very_longSet scheduler busy wait threshold. Default is medium. The threshold determines how long schedulers shoshould busy wait when running out of work before going to sleep.

Startup parameter: + sbwt none means that spin is completely disabled without having to solve it like WhatsApp patch beam.

 

Troubleshoot low Erlang scheduler CPU utilization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.