Load balancing of multi-thread routines load balance

Source: Internet
Author: User
Tags bitset rand strcmp

The paper wants to discuss the load balance, which is the load balance between various worker threads (processes) within a multithreaded (multi-process) server program.


Consider a common server model: A receiver thread is responsible for receiving requests, followed by a thread pool that contains a bunch of worker threads, and the requests received are assigned to these workers for processing. Receiver and the worker communicate through Pthread_cond+request_queue. The general practice is: Receiver will receive the request into the queue, and then signal the COND, OK. Specific which worker will be awakened, that is kernel thing (in fact kernel will follow the arrival principle, wake-up first into the waiting process, see "Linux Futex Analysis"). Usually this is enough, receiver wakes up the worker does not need to involve the logic of the load balance. But sometimes we can do some load balance work to improve the performance of the server.


Kernel Load Balance Overview





Since the load balance here is closely related to kernel's load balance, we need to see what kernel's load balance has done first. For more information, see "Linux kernel SMP load Balancing", here are just a few brief generalizations.





Plainly, kernel load balance do one thing: let the system in the running state of the process as far as possible to be allocated, in each of the dispatch field to see is balance. How do you understand it? Now the CPU structure generally has: physical CPU, core, hyper-threading, so several levels. "Balance on every dispatch field" is understood to be balance on every level: the total load on each physical CPU is equal to the total load on each core, and the load on each hyper-thread is equivalent.





We see in the system "CPU" is the lowest level of hyper-threading, we may intuitively think that the running state of the process of allocation to each "CPU" on the line, but in fact kernel load balance also have higher requirements. Suppose our machine has 2 physical CPUs, 2 core per physical CPU, and 2 hyper threads per core, a total of 8 "CPUs". If you now have 8 running processes (assuming the same priority), each "CPU" apportioned one process, then nature is balance. But if now there are only 4 running states (assuming the same priority), the real balance is not just that each process falls on a "CPU", but further requires that each physical CPU run two processes per core to run a process.


Why should there be such a strong constraint? Because although each "CPU" is logically independent (there is no master-slave relationship), they are not isolated. The "CPU" under the same physical CPU shares the cache, the "CPU" under the same core will share computing resources (the so-called Hyper-Threading is a set of lines running two threads). and sharing means scramble. Therefore, in the running state of the process is not exactly equal to each "CPU" case, we need to consider whether a higher level of the CPU is equally equal to avoid the cache and CPU line of contention (of course, in addition to performance, this also reflects the fairness of the kernel).





Finally, a little bit more, kernel's load balance is asynchronous. To avoid consuming too much resources, kernel is certainly not able to monitor each "CPU" situation in real time, and then react to changes in real time (except for real time processes, but this is outside the scope of our discussion).


Server's load balance consider





With the kernel load balance as a cushion, look at what the receiver thread on our server can do.





The first is the number of worker threads. What happens if there are too many worker numbers? Or assume that our machine has the 8 "CPUs" above, assuming we have 80 workers, and assume that the 80 threads are assigned to each "CPU" on average, waiting for the task to be processed. When a bunch of requests come in, because our receiver does not have any load balance strategy, the awakened worker appears on which "CPU" can be said to be random. What is the probability of a "simultaneous" arrival of 8 requests falling to 8 different "CPUs"? Yes: (70*60*50*40*30*20*10)/(79*78*77*76*75*74*73) =0.34%. In other words, almost certainly there will be some "CPU" to handle multiple requests, some "CPU" but idle nothing dry situation, the performance of the system can be imagined. And wait until after knowing the kernel load balance will these requests balance to each "CPU" on, may request already handled almost, wait until the next batch of requests arrives, load is still messy. Because the worker threads that have just been balance have been put back to the end of the cond wait queue, the first responders to the new requests are those that have not been balance at the head of the queue.


Would it be possible to reach balance after several rounds of requests? If the request is really round round, and each request processing time is exactly the same, then may reach the balance, but the actual situation must be far from the difference.


What is the solution? The cond first out of the queue-type wait logic to change to LIFO stack logic, may solve the problem, but the better way is to limit the number of worker equals or slightly less than the "CPU" number, so it is natural to balance.





The second question, since we recognize the significance of the kernel load balance on each dispatch domain, can the receiver thread in our server be able to derive benefits from a similar approach? Now that we've learned the lesson, we've only opened 8 worker threads. Depending on the role of the kernel load balance, these 8 threads are basically fixed on each "CPU". Let's say there are 4 of requests now, and they fall to 4 different CPUs. If you are lucky, these 4 "CPUs" belong to different core, then the process of processing the request will not involve the contention of CPU resources, on the contrary, may form 2 core very busy, 2 core idle situation.


To solve this problem, we need to do two, and continue with our previous server program as an example. First, the receiver thread has to know which "CPU" each worker thread falls on, and then it needs a balance perspective when assigning tasks. To achieve the 1th, it is best to use the sched_affinity function to fix the thread on a "CPU", to avoid kernel load balance the problem is complicated. Now that we have reached the conclusion that the number of worker threads is equal to or slightly less than the number of CPUs, it is possible for each thread to be fixed on one CPU. 2nd, we need to make some improvements on the basis of existing pthread_cond, assigning a priority to the worker thread that enters the waiting state, such as the first hyper-thread of each core as the first priority, and the second hyper-thread as the second priority. So when cond wakes up the worker thread, we can try to keep the worker thread from falling to the same core. The implementation can take advantage of the Futex Bitset series feature and Bitset to identify the priority so that the specified worker thread is awakened. (see "Linux Futex analysis.") )


Example





Well, talk about so much on paper, have some practical examples to verify. In order to be simple, no server program is written, only one producer thread and several consumer threads are required. The producer thread generates some tasks that are passed to the consumer thread by Cond+queue. In order to observe the performance of the program under different load, we need to control the task load. The consumer thread responds to the producer thread through another set of Cond+queue after completing the task, so the producer knows how many tasks are currently being processed in order to control the tempo of the new task. Finally, we can realize the performance of the program by observing the time to complete a batch of tasks under different conditions.





This is the key to the task itself is the processing logic, since we are talking about the CPU load, the task must be CPU-intensive tasks. Then, the processing time of a single task should not be too short, otherwise the scheduling process will become the bottleneck of the program, reflecting the CPU load problem; On the other hand, the processing time of a single task should not be too long, otherwise the kernel load balance can solve the problem after knowing it. It does not reflect the benefits of our initiative to do the load balance (such as task processing time is 10 seconds, kernel load balance spend dozens of milliseconds to solve the balance problem is actually harmless).





Code pasted at the end of the article, the compiled bin file is like this:





$g + + Cond.cpp-pthread-o2


$./a.out


Usage:./a.out-j job_kind=shm|calc [-t thread_count=1]


[-O job_load=1] [-C job_count=10] [-A affinity=0] [-l]


[-F Filename= "./test"-N filelength=128m]





The code contains two kinds of task logic, "-j SHM" is mmap a file, then read the above data to do some operations (file and its length by the-F and-n parameters to qualify); "-j Calc" is doing some arithmetic operations;


The "-T" parameter specifies the number of threads for the worker thread;


"-O" specifies the task load;


"-C" specifies the number of tasks that a single thread handles;


"-a" specifies whether to set sched_affinity and indicates that several "CPUs" are skipped to put a worker thread. For example, "-a 1″ indicates that the worker thread order is fixed to 1, 2, 3 、...... Number "CPU", and "-a 2″ is fixed at 2, 4, 6 、...... Number "CPU", and so on. It should be noted that the adjacent "CPU" number does not mean that the "CPU" is physically contiguous, such as on my test machine, a total of 24 "CPUs", the 0~11 number is the first hyper-threading of each core, 12~23 is the second hyper-thread. This detail needs to be read/proc/cpuinfo to determine.


The "-L" parameter specifies that the rating cond of our enhanced version is enabled, and the word 0~11 is used as the first priority, 12~23 as the second priority (of course, the need to match the "-a" parameter is practical, otherwise it is not sure what the "CPU" these worker all fall on);





First look at the problem with too many worker threads (the following case each runs 5 times minimum).





Case-1, 240 worker threads, 24 task load:


$./a.out-j Calc-t 240-o 24


Total cost:23790


$./a.out-j Shm-t 240-o 24


Total cost:16827





Case-2, 24 worker threads, 24 task load:


$./a.out-j Calc-t 24-o 24


Total cost:23210


$./a.out-j Shm-t 24-o 24


Total cost:16121





Case-2 effect is significantly better. And if you look at it during the run, you'll find that the case-1 can only be pressed to about 2,200% of the CPU, while the case-2 can reach almost 2,400%.





On the basis of case-1, what if the kernel load balance is forbidden? Add affinity Try:





Case-3, 240 worker threads, 24 task loads, plus affinity:


$./a.out-j calc-t 240-o 24-a 1


Total cost:27170


$./a.out-j shm-t 240-o 24-a 1


Total cost:15351





The Calc task is more in line with expectations, and performance continues to decline without kernel load balance.


But the SHM task lets the people be surprised, the performance actually promoted! In fact, this task in addition to CPU is also very dependent on memory, because all tasks are working in the same file on the mmap, "CPU" close to more can play memory cache. (visible in this case, kernel load balance actually help the busy.) )


So, is it more desirable that we turn the worker thread back to 24?





Case-3 '


$./a.out-j shm-t 24-o 24-a 1


Total cost:15133





To look at the second question, the impact of worker thread positioning is uneven.





Case-4, 24 worker threads, 12 task load:


$./a.out-j CALC-T 24-o 12


Total cost:14686


$./a.out-j shm-t 24-o 12


Total cost:13265





Case-5, 24 worker threads, 12 task loads, add affinity, enable rating cond:


$./a.out-j calc-t 24-o 12-a 1-l


Total cost:12206


$./a.out-j shm-t 24-o 12-a 1-l


Total cost:12376





The effect is still good. Change the "-a" argument so that all two hyper-threads of the same core are given the same priority?





Case-5 '


$./a.out-j calc-t 24-o 12-a 2-l


Total cost:23510


$./a.out-j shm-t 24-o 12-a 2-l


Total cost:15063





Because of the scramble for CPU resources, the Calc task performance became very poor and nearly halved. and the SHM task due to the benefits of cache reuse, the situation is good (more than case-3 slightly better).





The task here is just to cite two examples of Calc and SHM, and the actual situation can be complex. Although the problem of load balance certainly exists, will the task be benefited by sharing the cache, or is it due to a scramble for cache? How much damage does it cost to scramble for a CPU line? These can only be concrete analysis of specific problems. Kernel's load balance the load as much as possible to the "CPU" far away, and most of the cases are fine. However, we also see that the SHM task cache share is still very large, if the example is more extreme, there will certainly be the load of the CPU closer, but the effect is better.


On the other hand, the scramble for the CPU pipeline will have much loss, can also be a simple analysis. Hyper-Threading is equivalent to two threads sharing a set of CPU lines, if the code context of a single thread is very serious, the instructions can only work serially, can not fully use the pipeline, then the spare capacity of the pipeline can be left to the second thread to use. Conversely, if a thread can fill a pipeline, a two-thread entry must have only 50% performance (as is the case with Calc).


To illustrate this, we have added a SERIAL_CALC macro switch to the Calc task to make its operational logic a contextual strong dependency. And then you run the two commands in the case-5, and we see that in this case, the CPU that is under the load is a little bit closer to the problem:





Case-6, using Serial_calc operation Logic, re-running the Calc task in Case-5


$g + + Cond.cpp-pthread-o2-dserial_calc


$./a.out-j calc-t 24-o 12-a 1-l


Total cost:51269


$./a.out-j calc-t 24-o 12-a 2-l


Total cost:56753





Finally is the code, interested you can also try more case,have fun!





#include <pthread.h>


#include <unistd.h>


#include <stdio.h>


#include <stdlib.h>


#include <sys/time.h>


#include <sched.h>


#include <sys/types.h>


#include <errno.h>


#include <string.h>


#include <linux/futex.h>


#include <sys/time.h>


#include <sys/stat.h>


#include <fcntl.h>


#include <sys/mman.h>


#include <math.h>


#include <sys/syscall.h>





#define CPUS 24


#define Futex_wait_bitset 9


#define FUTEX_WAKE_BITSET 10





struct JOB


{


Long _input;


Long _output;


};





Class Jobrunner


{


Public


virtual void run (job* Job) = 0;


};





Class Shmjobrunner:public Jobrunner


{


Public


Shmjobrunner (const char* filepath, size_t length)


: _length (length) {


int FD = open (filepath, o_rdonly);


_base = (long*) mmap (NULL, _length*sizeof (Long),


Prot_read, map_shared| Map_populate, FD, 0);


if (_base = = map_failed) {


printf ("Fatal:mmap%s (%lu) failed!\n",


FilePath, _length*sizeof (long));


Abort ();


}


Close (FD);


}


virtual void run (job* Job) {


Long i = job->_input% _length;


Long J = i + _length-1;


const int step = 4;


while (i + Step < j) {


if (_base[i%_length] * _base[j%_length] > 0) {


J-= step;


}


else {


i + = step;


}


}


Job->_output = _base[i%_length];


}


Private


Const long* _base;


size_t _length;


};





Class Calcjobrunner:public Jobrunner


{


Public


virtual void run (job* Job) {


Long V1 = 1;


Long v2 = 1;


Long v3 = 1;


for (int i = 0; i < job->_input; i++) {


#ifndef Serial_calc


V1 + + v2 + v3;


V2 *= 3;


V3 *= 5;


#else


V1 + + v2 + v3;


V2 = v1 * 5 + v2 * v3;


V3 = v1 * 3 + v1 * v2;


#endif


}


Job->_output = v1;


}


};





Class Jobrunnercreator


{


Public


Static jobrunner* Create (const char* name,


Const char* filepath, size_t filelength) {


if (strcmp (name, "SHM") = = 0) {


printf ("Share memory job\n");


return new Shmjobrunner (filepath, filelength);


}


else if (strcmp (name, calc) = = 0) {


printf ("Caculation job\n");


return new Calcjobrunner ();


}


printf ("Unknown job '%s ' \ n", name);


return NULL;


}


};





Class Cond


{


Public


virtual void lock () = 0;


virtual void unlock () = 0;


virtual void Wait (size_t) = 0;


virtual void wake () = 0;


};





Class Normalcond:public Cond


{


Public


Normalcond () {


Pthread_mutex_init (&_mutex, NULL);


Pthread_cond_init (&_cond, NULL);


}


~normalcond () {


Pthread_mutex_destroy (&_mutex);


Pthread_cond_destroy (&_cond);


}


void Lock () {pthread_mutex_lock (&_mutex);}


void Unlock () {Pthread_mutex_unlock (&_mutex);}


void Wait (size_t) {pthread_cond_wait (&_cond, &_mutex);}


void Wake () {pthread_cond_signal (&_cond);}


Private


pthread_mutex_t _mutex;


pthread_cond_t _cond;


};





Class Layeredcond:public Cond


{


Public


Layeredcond (size_t layers = 1): _value (0), _layers (layers) {


Pthread_mutex_init (&_mutex, NULL);


if (_layers > sizeof (int) *8) {


printf ("Fatal:cannot support such layer%u (max%u) \ n",


_layers, sizeof (int) *8);


Abort ();


}


_waiters = new Size_t[_layers];


memset (_waiters, 0, sizeof (size_t) *_layers);


}


~layeredcond () {


Pthread_mutex_destroy (&_mutex);


Delete _waiters;


_waiters = NULL;


}


void Lock () {


Pthread_mutex_lock (&_mutex);


}


void Unlock () {


Pthread_mutex_unlock (&_mutex);


}


void Wait (size_t layer) {


if (layer >= _layers) {


printf ("Fatal:layer overflow (%u/%u) \ n", layer, _layers);


Abort ();


}


_waiters[layer]++;


while (_value = = 0) {


int value = _value;


Unlock ();


Syscall (__nr_futex, &_value, Futex_wait_bitset, value,


NULL, NULL, Layer2mask (layer));


Lock ();


}


_waiters[layer]--;


_value--;


}


void Wake () {


int mask = ~0;


Lock ();


for (size_t i = 0; i < _layers; i++) {


if (_waiters[i] > 0) {


Mask = Layer2mask (i);


Break


}


}


_value++;


Unlock ();


Syscall (__nr_futex, &_value, Futex_wake_bitset, 1,


NULL, NULL, mask);


}


Private


int Layer2mask (size_t layer) {


return 1 << layer;


}


Private


pthread_mutex_t _mutex;


int _value;


size_t* _waiters;


size_t _layers;


};





Template<class t>


Class Stack


{


Public


Stack (size_t size, size_t cond_layers = 0): _size (size), _sp (0) {


_buf = new T*[_size];


_cond = (cond_layers > 0)?


(cond*) New Layeredcond (cond_layers): (cond*) New Normalcond ();


}


~stack () {


delete []_buf;


Delete _cond;


}


t* Pop (size_t layer = 0) {


t* ret = NULL;


_cond->lock ();


do {


if (_sp > 0) {


ret = _buf[--_sp];


}


else {


_cond->wait (layer);


}


while (ret = NULL);


_cond->unlock ();


return ret;


}


void push (t* obj) {


_cond->lock ();


if (_sp >= _size) {


printf ("Fatal:stack overflow\n");


Abort ();


}


_buf[_sp++] = obj;


_cond->unlock ();


_cond->wake ();


}


Private


Const size_t _size;


size_t _sp;


t** _buf;


cond* _cond;


};





inline struct Timeval cost_begin ()


{


struct Timeval TV;


Gettimeofday (&TV, NULL);


return TV;


}





inline long cost_end (struct timeval &tv)


{


struct timeval TV2;


Gettimeofday (&TV2, NULL);


Tv2.tv_sec-= tv.tv_sec;


Tv2.tv_usec-= tv.tv_usec;


return tv2.tv_sec*1000+tv2.tv_usec/1000;


}





struct Threadparam


{


size_t layer;


stack<job>* INPUTQ;


stack<job>* OUTPUTQ;


Jobrunner* runner;


};





void* thread_func (void *data)


{


size_t layer = ((threadparam*) data)->layer;


stack<job>* INPUTQ = ((threadparam*) data)->inputq;


stack<job>* OUTPUTQ = ((threadparam*) data)->outputq;


Jobrunner* runner = ((threadparam*) data)->runner;





while (1) {


job* job = inputq->pop (layer);


Runner->run (Job);


Outputq->push (Job);


}


return NULL;


}





void Force_cpu (pthread_t t, int n)


{


cpu_set_t CPUs;


Cpu_zero (&cpus);


Cpu_set (n, &cpus);


if (pthread_setaffinity_np (t, sizeof (CPUs), &cpus)!= 0) {


printf ("Fatal:force CPU%d failed:%s\n", N, Strerror (errno));


Abort ();


}


}





void usage (const char* bin)


{


printf ("Usage:%s-j job_kind=shm|calc")


"[-t thread_count=1] [-O job_load=1] [-C job_count=10]"


"[-a affinity=0] [-l]"


"[F filename=\"./test\ "-N filelength=128m]\n", bin);


Abort ();


}





int main (int argc, char* const* argv)


{


int thread_count = 1;


int job_load = 1;


int job_count = 10;


int AFFINITY = 0;


int LAYER = 0;


Char job_kind[16] = "";


Char filepath[1024] = "./test";


size_t LENGTH = 128*1024*1024;


for (int i = EOF;


(i = getopt (argc, argv, "t:o:c:a:j:lf:n:"))!= EOF;) {


switch (i) {


Case ' t ': Thread_count = atoi (Optarg); Break


Case ' o ': job_load = atoi (Optarg); Break


Case ' C ': Job_count = atoi (Optarg); Break


Case ' a ': AFFINITY = atoi (Optarg); Break


Case ' l ': LAYER = 2; Break


Case ' J ': strncpy (Job_kind, Optarg, sizeof (Job_kind)-1); Break


Case ' F ': strncpy (FILEPATH, Optarg, sizeof (FILEPATH)-1); Break


Case ' n ': LENGTH = atoi (Optarg); Break


Default:usage (Argv[0]); Break


}


}


Jobrunner* runner = Jobrunnercreator::create (


Job_kind, FILEPATH, LENGTH);


if (!runner) {


Usage (argv[0]);


}





Srand (0);


Job Jobs[job_load];





#ifdef Test_load


for (int i = 0; i < job_load; i++) {


Jobs[i]._input = rand ();


struct Timeval TV = Cost_begin ();


Runner->run (&jobs[i]);


Long cost = cost_end (TV);


printf ("job[%d] (%ld) = (%ld) Costs:%ld\n",


I, Jobs[i]._input, jobs[i]._output, cost);


}


Delete runner;


return 0;


#endif





printf ("Use layer%d\n", layer);


Stack<job> inputq (Job_load, LAYER);


Stack<job> outputq (Job_load, LAYER);





pthread_t T;


Threadparam Param[thread_count];





printf ("Thread init:");


for (int i = 0; i < Thread_count; i++) {


int cpu = AFFINITY? (I/AFFINITY+I%AFFINITY*CPUS/2)%cpus:-1;


size_t layer =!! (LAYER && i% CPUS >= CPUS/2);


PARAM[I].INPUTQ = &inputQ;


PARAM[I].OUTPUTQ = &outputQ;


Param[i].runner = runner;


Param[i].layer = layer;


Pthread_create (&t, NULL, Thread_func, (void*) &param[i]);


if (CPU >= 0) {


printf ("%d (%d|%d),", I, CPU, layer);


FORCE_CPU (t, CPU);


}


else {


printf ("%d (*|%d),", I, layer);


}


Usleep (1000);


}


printf ("\ n");





struct Timeval TV = Cost_begin ();


for (int i = 0; i < job_load; i++) {


Jobs[i]._input = rand ();


Inputq.push (&jobs[i]);


}


for (int i = 0; i < Job_load*job_count; i++) {


job* job = Outputq.pop ();


Job->_input = rand ();


Inputq.push (Job);


}


for (int i = 0; i < job_load; i++) {


Outputq.pop ();


}


Long cost = cost_end (TV);


printf ("Total Cost:%ld\n", cost);





Delete runner;


return 0;


}





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.