Speedup-example.cpp
/* Copyright (c) 2013, Regents of the Columbia University * All rights reserved. * * Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other * materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <errno.h>#include <assert.h>//#include "tern/user.h"#define N 4//#define M 30000#define M 30000*10//int nwait = 0;volatile long long sum;volatile long long sum_2;long loops = 6e3;pthread_mutex_t mutex;pthread_cond_t cond;pthread_barrier_t bar;void set_affinity(int core_id) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core_id, &cpuset); assert(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) == 0);}void* thread_func(void *arg) { set_affinity((int)(long)arg); for (int j = 0; j < M; j++) { //pthread_mutex_lock(&mutex); //nwait++; for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the mutex needs to be a little bit congested. { sum += i;sum_2 += i; } //pthread_cond_wait(&cond, &mutex); //printf("being in the lock is: %lu\n", pthread_self()); //pthread_mutex_unlock(&mutex); //soba_wait(0); //pthread_barrier_wait(&bar); for (long i = 0; i < loops; i++) {sum += i*i*i*i*i*i; } //fprintf(stderr, "compute thread %u %d\n", (unsigned)thread, sched_getcpu()); }}int main(int argc, char *argv[]) { set_affinity(23); //soba_init(0, N, 20); pthread_t th[N]; int ret; //pthread_cond_init(&cond, NULL); //pthread_barrier_init(&bar, NULL, N); for(unsigned i=0; i<N; ++i) { ret = pthread_create(&th[i], NULL, thread_func, (void*)i); assert(!ret && "pthread_create() failed!"); } /*for (int j = 0; j < M; j++) { while (nwait < N) { sched_yield(); } pthread_mutex_lock(&mutex); nwait = 0; //fprintf(stderr, "broadcast %u %d\n", (unsigned)pthread_self(), sched_getcpu()); pthread_cond_broadcast(&cond); pthread_mutex_unlock(&mutex); } */ for(unsigned i=0; i<N; ++i) pthread_join(th[i], NULL); // test the validity of results printf("sum_2 = %lld\n", sum_2); //printf("sum = %lld\n", sum); exit(0);}
The above is the route of our entire journey, and below is the origin of our journey.
Xtern/lib/runtime/record-scheduler.cpp
void RRScheduler::getTurn(){ int tid = self(); assert(tid>=0 && tid < Scheduler::nthread); waits[tid].wait(); dprintf("RRScheduler: %d gets turn\n", self()); SELFCHECK;}
Self () returns the TID of the current thread, from struct tidmap in xtern/include/TERN/runtime/schedmap. h:
/// tern tid for current thread static int self() { return self_tid; }
Rrschedturn: getturn is the unique entry of rrschedit: wait_t: Wait.
Getturn is used to run a thread while waiting for other threads to achieve serialization.
In fact, getturn is first seen in struct serializer, So here comes the inheritance relationship of some important classes.
Next, we need to understand where getturn is called and Its Role in RR scheduling.
There are five direct calls to getturn () in 34 times:
- Xtern/lib/runtime/record-runtime.cpp: Line 337: sched_timer_start (30 calls in total)
- Xtern/lib/runtime/record-runtime.cpp: Line 272: recorderrt <_ S>: idle_sleep (1 call in total)
- Xtern/lib/runtime/record-scheduler.cpp: Line 372: rrscheduler: Wait (1 call in total)
- Xtern/lib/runtime/record-runtime.cpp: Line 376: recorderrt <_ S>: printstat (1 call in total)
- Xtern/lib/runtime/record-runtime.cpp: Line 286: recorderrt <_ S>: idle_cond_wait (1 call in total)
While the functions that call sched_timer_start have (both in xtern/lib/runtime/record-runtime.cpp ):
- Recorderrt <_ S>: threadbegin (void)
- Recorderrt <_ S>: threadend (unsigned INS)
- Recorderrt <_ s >:: pthreadcreate
- Recorderrt <_ s >:: pthreadjoin
- Recorderrt <_ s >:: pthreadmutexinit
- Recorderrt <_ s >:: pthreadmutexdestroy
- Recorderrt <_ s >:: pthreadmutexlock
- Recorderrt <_ s >:__ pthread_rwlock_rdlock
- Recorderrt <_ s >:__ pthread_rwlock_wrlock
- Recorderrt <_ s >:__ pthread_rwlock_tryrdlock
- Recorderrt <_ s >:__ pthread_rwlock_trywrlock
- Recorderrt <_ s >:__ pthread_rwlock_unlock
- Recorderrt <_ s >:__ pthread_rwlock_destroy
- Recorderrt <_ s >:__ pthread_rwlock_init
- Recorderrt <_ s >:: pthreadmutextrylock
- Recorderrt <_ S>: pthreadmutextimedlock
- Recorderrt <_ s >:: pthreadmutexunlock
- Recorderrt <_ S>: pthreadbarrierinit
- Recorderrt <_ s >:: pthreadbarrierwait
- Recorderrt <_ s >:: pthreadbarrierdestroy
- Recorderrt <_ S>: pthreadcondwait
- Recorderrt <_ S>: pthreadcondtimedwait
- Recorderrt <_ S>: pthreadcondsignal
- Recorderrt <_ S>: pthreadcondbroadcast
- Recorderrt <_ s >:: semwait
- Recorderrt <_ s >:: semtrywait
- Recorderrt <_ S>: semtimedwait
- Recorderrt <_ s >:: sempost
- Recorderrt <_ s >:: seminit
- Recorderrt <_ s >:: lineupinit
- Recorderrt <_ s >:: lineupdestroy
- Recorderrt <_ s >:: lineupstart
- Recorderrt <_ s >:: lineupend
- Recorderrt <_ S>: nondetstart ()
- Recorderrt <_ s >:: symbolic
- Recorderrt <recordserializer>: pthreadbarrierwait
- Recorderrt <recordserializer>: pthreadcondwait
- Recorderrt <recordserializer>: pthreadcondtimedwait
- Recorderrt <recordserializer>: pthreadcondsignal
- Recorderrt <recordserializer>: pthreadcondbroadcast
- Recorderrt <_ s >:__ fork
- Recorderrt <_ s >:__ execv
- Recorderrt <_ S>: schedyield
- Recorderrt <_ s >:__ sleep
- Recorderrt <_ s >:__ usleep
- Recorderrt <_ s >:__ nanosleep
Determine which functions are called. If a function is tested, it is cumbersome and error-prone, so I plan to view the entire call chain from the running source.
First
First, start our trip from "East China Sea inverted Mountain" _ libc_start_main (xtern/dync_hook/spec_hooks.cpp: Line 73) (or compare the treasure hunt ^-^ ). Through "permanent pointer" printf, we came to the _ tern_prog_begin (xtern/lib/runtime/helper. cpp: Line 123) Island on the "Great route. Let's look for what we want on it. Sure enough, there is a treasure, tern: installruntime (). Along with it, we found runtime, a crucial partner for our big voyage :: the = new recorderrt <rrscheduler> (xtern/lib/runtime/record-runtime.cpp: Line 183 ). The partner is a swordsman (recorderrt) who teaches the runtime of the East China Sea and the rrscheduler of the qiwuhai Wang ). Next, under the guidance of the, we came to the teacher's territory as the rrscheduler of the qiwuhai. We want to request something from him. But rrschedmap didn't, but with his help, we locked the treasure location through the tidmap of his master (scheder)'s master (serializer. It is hidden in init in tidmap (pthread_t main_th) {Init (main_th);} In xtern/include/TERN/runtime/schedmap. h. Sure enough, we finally found the treasure create () in init ().
The treasure create tells us that the treasure tern: installruntime () We found previously needs to be connected to a treasure, and this treasure should be near the place where we found tern: installruntime. No way, you have to return _ tern_prog_begin (xtern/lib/runtime/helper. cpp: Line 123) Island. Under the folder of printf ("liuqiushan8019 \ n") and printf ("liuqiushan8020 \ n"), we locked the location to tern_thread_begin (); // main thread begins. At this time, the system made a big move. The runtime: The-> threadbegin (). After defeating the enemy, it finally won the treasure sched_timer_start.
Second
The last time we talked about it, we got a treasure named create (). It was created by a brother. We followed this line: __libc_start_main->__ tern_prog_begin-> tern_pthread_create. We found that there was an island named runtime in tern_p:: The-> pthreadcreate, there is another create () on it, but currently we do not know its function, only know that it is printed with idle. This made me want to explore the home room of create () to see how many siblings it has and how each create () has a mission. Now we need to understand the creation of the four sub-threads of the execution order is what kind of, is a thread first to complete the task, let another thread to execute, or A-B-C-D such continuous loop, is it close to the end of a piece? When ld_preload is available (that is, parrot is used), four sub-threads are created, M = 30000, and loops = 6e3. This is a wonderful phenomenon: When pthread_mutex_lock (& mutex) exists) and pthread_mutex_unlock (& mutex), the execution order of the four sub-threads is m cycles of the A-B-C-D; when there is no mutex, the execution order of the four sub-threads is a to execute all their own tasks, then B starts execution, and then C and D, without mutex, the result is accurate. Let's ask the huge God vtune to help us analyze the differences and causes of these two sequences. This is the first strong enemy we face in sailing. I am here to give the captain the command: all the crew members are ready to fight!
Mutex:
Without mutex:
Next, let's take a look at how these four subthreads are created from the source code perspective. We found tern_pthread_create when we looked for the create with an idle. I guess it is the entry for creating idle and the other four subthreads. What's strange is that tern_pthread_create? In parrot, only two calls of tern_pthread_create are called, but none are identified. One is in xtern \ Lib \ Runtime \ record-runtime.cpp: Line 2183, which is not executed throughout the process; the other is in xtern \ Lib \ Runtime \ helper. CPP: Line 152, which is used to create the first idle thread. We will note it without affecting the running of speedup-example. So we boldly guess: according to the principle of ld_preload, we use pthread_create in the speedup-example.cpp, although in parrot in xtern \ eval \ rand-intercept \ rand-intercept.c: Line 251 there is a function with the same name pthread_create, but it is not used, so we assume that tern_pthread_create replaces pthread_create in the speedup-example.cpp. In this way, we roughly figured out what parrot was doing from the beginning of the program to the creation of four sub-threads --
(A) The program starts to run until four sub-threads are created.
In short, runtime: The is initialized, main thread begins is initialized, and an idle thread is created.
(B) Use tern_pthread_create to create four subthreads.
In the third round, let's explore what tern_pthread_create has done.
Third time
The last time we talked about it, we met the first strong competitor, "two orders". In the first confrontation, he made a big move: "Use tern_pthread_create to create subthreads ". Next, let's take a look at how we break it.
This is the principle decomposition diagram of the big moves. When recorderrt <_ S>: pthreadcreate is analyzed, a prompt is displayed:
/// The pthread_create wrapper solves three problems.////// Problem 1. We must assign a logical tern tid to the new thread while/// holding turn, or multiple newly created thread could get their logical/// tids nondeterministically. To do so, we assign a logical tid to a new/// thread in the thread that creates the new thread.////// If we were to assign this logical id in the new thread itself, we may/// run into nondeterministic runs, as illustrated by the following/// example////// t0 t1 t2 t3/// getTurn();/// create t2/// putTurn();/// getTurn();/// create t3/// putTurn();/// getTurn();/// get turn tid 2/// putTurn();/// getTurn();/// get turn tid 3/// putTurn();////// in a different run, t1 may run first and get turn tid 2.////// Problem 2. When a child thread is created, the child thread may run/// into a getTurn() before the parent thread has assigned a logical tid/// to the child thread. This causes getTurn to refer to self_tid, which/// is undefined. To solve this problem, we use @thread_begin_sem to/// create a thread in suspended mode, until the parent thread has/// assigned a logical tid for the child thread.////// Problem 3. We can use one semaphore to create a child thread/// suspended. However, when there are two pthread_create() calls, we may/// still run into cases where a child thread tries to get its/// thread-local tid but gets -1. consider////// t0 t1 t2 t3/// getTurn();/// create t2/// sem_post(&thread_begin_sem)/// putTurn();/// getTurn();/// create t3/// sem_wait(&thread_begin_sem);/// self_tid = TidMap[pthread_self()]/// sem_post(&thread_begin_sem)/// putTurn();/// getTurn();/// get turn tid 2/// putTurn();/// sem_wait(&thread_begin_sem);/// self_tid = TidMap[pthread_self()]/// getTurn();/// get turn tid 3/// putTurn();////// The crux of the problem is that multiple sem_post can pair up with/// multiple sem_down in different ways. We solve this problem using/// another semaphore, thread_begin_done_sem.///
Some Key Techniques of the jianfa suddenly appeared everywhere: pthread_create wrapper, assign a logical TID to a new thread, holding turn, getturn...
It seems to be a good sword.
It is still difficult to understand the number of decomposition charts and some of the musical scores. Let's take a look at the news from our partner's top chefs:
Scenario 1:./speedup-example (mutex has four sub-threads, M = 30000*10, and parrot is not used)
Scenario 2:./run_example_tern (with mutex, four sub-threads, M = 30000*10, using parrot)
Scenario 3:./run_example_tern (no mutex, four sub-threads, M = 30000*10, use parrot)
Through top, we found that --
- Compared with scenario 1, the ID column (green part) is 0.0%, while the ID column of scenario 1 occupies a large percentage.
- Compared with scenario 2, the SY column (orange part) is 0.0%, while the SY column of Scenario 2 occupies almost half of the total.
- The four cores (cpu0, 1, 2, 3) of scenario 1 and scenario 2 run one after another (cpu0-> cpu1-> cpu2-> cpu3 ).
This allows us to explore the order in which four subthreads are created under scenario 2 and scenario 3. Imagine first: Scenario 2 is that four subthreads are almost created at the same time, and scenario 2 is to create two subthreads at the same time at the beginning, then, each time a subthread finishes its own task, it creates a new subthread.
Scenario 4:./run_example_tern (there are mutex, eight sub-threads, M = 30000*10, use parrot)
Scenario 5:./run_example_tern (no mutex, eight sub-threads, M = 30000*10, use parrot)
One piece for parrot source code analysis