8.2 programming model and Multithreading
Parallel design is the most important concept in designing a multi-threaded application and realizing the optimal performance improvement for multi-processor. An optimized multi-threaded application features large-scale concurrency or minimal dependency in the following fields:
● Workload
● Thread Interaction
● Hardware Utilization
The key to maximizing workload concurrency is to identify multiple tasks with minimum mutual dependencies within an application, and create independent threads for these tasks for parallel execution.
Concurrent execution of independent threads is the essence of deploying a multi-threaded application in a multi-processor system. Managing the interaction between threads to minimize the thread synchronization cost is also critical to achieving optimal performance growth with multiple processors.
Efficient use of hardware resources between concurrent threads requires Optimization Technology in specific fields to prevent hardware resource competition. Programming Techniques for optimizing thread synchronization and managing other hardware resources will be discussed in the subsequent sections.
Next we will discuss the parallel programming model.
8.2.1 Parallel Programming Model
There are two common programming models for converting the needs of Independent tasks into application threads:
● Domain Decomposition
● Functional decomposition
8.2.1.1 Domain Decomposition
Generally, large-scale intensive computing tasks use datasets that can be divided into small subsets. Each subset has a large computing independence. Examples:
● Calculate a discrete cosine transform (DCT) on a two-dimensional data, divide the two-dimensional data into several subsets, and create corresponding threads for each subset to calculate the transformation.
● Matrix Multiplication. Here, a thread can be created to use the multiplier matrix to perform the multiplication of the half matrix.
Domain decomposition is a programming model that independently processes smaller data fragments based on the creation of identical or similar threads. [Translator's note: here the so-called same (identical) means that the execution routines of several threads are all the same function.] This model can take advantage of the existing execution resources replicated in a traditional multi-processor system. It can also use the shared execution resources between two logical processors in the HT technology. This is because a data domain thread generally only consumes some of the available execution resources on the chip.
8.2.2 functional decomposition
Applications usually use different functions and many irrelevant datasets to process many different tasks. For example, a video codec requires several different processing functions. These include DCT, motion estimation, and color conversion. Using a functional thread model, applications can program independent threads for motion estimation, color conversion, and other functional tasks.
Feature breakdown enables more flexible thread-Level Parallelism if it relies less on copies of hardware resources. For example, a thread that executes a sorting algorithm and a thread that executes a matrix multiplication routine are unlikely to need the same execution unit at the same time. A design that recognizes this can take advantage of traditional multi-processor systems, or use a multi-processor system that supports HT technology.
8.2.3 dedicated programming model
The Intel Core Duo processor and Intel Core microarchitecture-based Processor provide an L2 cache shared by the two processor cores in the same physical package. This provides two application threads with the opportunity to access some application data while minimizing the bus traffic load.
Multi-threaded applications may need to use special programming models to take advantage of this type of hardware features. One scenario is producer-consumer. In this scenario, a thread writes data to a specific purpose (hopefully in the L2 cache ), the other thread that runs on other cores of the same physical package then reads the data produced by the first thread.
The basic method for implementing a producer-consumer model is to create two threads. One thread is the producer and the other is the consumer. Generally, the producer and consumer take turns to operate on a cache and notify the other party when they are ready to exchange the cache. In a producer-consumer model, some threads load synchronization when the cache is exchanged between the producer and the consumer. To achieve the optimal performance increase for a given number of cores, the synchronization load must be kept to a minimum. This ensures that the producer and consumer threads have a considerable number of time constants to complete each incremental task before switching the cache.
Example 8-1 describes the code structure of a series of task units executed in a single thread. Each Task Unit (either a producer or a consumer) is executed in serial mode (8-2 ). In equivalent scenarios with multiple threads, each producer-consumer pair is encapsulated as a thread function, and two threads can be simultaneously scheduled on available processor resources.
Example 8-1 serial execution of producer and consumer work items
For (I = 0; I <number_of_iterations; I ++) {producer (I, buff); // transfer cache index and cache address consumer (I, buff );}
8.2.3.1 producer-consumer thread model
Figure 8-3 describes the basic mode of interaction between a pair of producer and consumer threads. The horizontal direction indicates the time. Each block represents a task unit and processes the cache assigned to a thread.
The gap between each task indicates the synchronization load. The decimal number in parentheses represents a cache index. On an Intel Core Duo processor, the producer thread can store data in the L2 cache to allow the consumer thread to continue the job, minimizing the need for bus traffic.
The producer and consumer thread functions are implemented, and the basic structure of synchronization with the cache index is displayed in Example 8-2.
Example 8-2: implement the basic structure of the producer-consumer thread
// (A) the basic structure of a producer thread function void producer_thread () {int iter_num = workamount-1; // copy int mode1 = 1 locally; // use 0 and 1 to track the two caches using produce (buffs [0], count); // The placeholder function while (iter_num --) {signal (& signal1, 1 ); // tell another thread that the job can start produce (buffs [mode1], count); // placeholder function waitforsignal (& End, 1); mode1 = 1-mode1; // switch to another cache} // (B) the basic structure of the consumer thread void consume_thread () {int mode2 = 0; // The first iteration from cache 0, then switch int iter_num = workamount-1; while (iter_num --) {waitforsignal (& signal1); consume (buffs [mode2, count); // placeholder function signal (& end, 1); mode2 = 1-mode2;} consume (buffs [mode2], count );}
It is also possible to construct a producer-consumer model in an staggered form, which minimizes bus traffic and is also effective for multi-core processors that do not share L2 cache.
In this staggered version of the producer-consumer model, each scheduling volume of an application thread is composed of a producer task and a consumer task. Two identical threads are concurrently created for execution. During the scheduling of a thread, the producer task starts first, and then the consumer task follows the completion of the producer task. Both tasks perform the same cache operation. After each task is completed, one thread sends a signal to another thread to notify the corresponding task to use the cache assigned to it. Thus, producer and consumer tasks are executed in parallel in two threads. As long as the data generated by the producer resides in the L1 or L2 cache of the same core, consumers can directly access them without suffering from bus traffic. The scheduling of the staggered producer-consumer model is shown in Figure 8-4.
Example 8-3 shows the basic structure of a thread function that can be used in this staggered producer-consumer model.
Example 8-3: A staggered producer-consumer model thread function
// The Master thread starts the first iteration. The other thread must wait for an iteration void producer_consumer_thread (INT master) {int mode = 1-master; // trace which thread and its assigned cache index unsigned int iter_num = workamount> 1; unsigned int I = 0; iter_num + = Master & workamount & 1; if (master) // The Master thread starts the first iteration {produce (buffs [mode], count); signal (sigp [1-mode], 1); // notifies the consumer task following the thread, it can process consume (buffs [mode], count); signal (sigc [1-mode], 1); I = 1 ;}for (; I <iter_num; I ++) {waitforsignal (sigp [mode]); produce (buffs [mode], count ); // notification consumer task signal (sigp [1-mode], 1); waitforsignal (sigc [mode]); consume (buffs [mode], count ); signal (sigc [1-mode], 1 );}}
8.2.4 tools used to create multi-threaded applications
-- Note: This section is basically Intel's soft paper ~