Intel64 and IA-32 Architecture Optimization Guide Chapter 1 multi-core and hyper-Threading Technology-8th programming model and Multithreading

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

8.2 programming model and Multithreading

Parallel design is the most important concept in designing a multi-threaded application and realizing the optimal performance improvement for multi-processor. An optimized multi-threaded application features large-scale concurrency or minimal dependency in the following fields:

● Workload

● Thread Interaction

● Hardware Utilization

The key to maximizing workload concurrency is to identify multiple tasks with minimum mutual dependencies within an application, and create independent threads for these tasks for parallel execution.

Concurrent execution of independent threads is the essence of deploying a multi-threaded application in a multi-processor system. Managing the interaction between threads to minimize the thread synchronization cost is also critical to achieving optimal performance growth with multiple processors.

Efficient use of hardware resources between concurrent threads requires Optimization Technology in specific fields to prevent hardware resource competition. Programming Techniques for optimizing thread synchronization and managing other hardware resources will be discussed in the subsequent sections.

Next we will discuss the parallel programming model.

8.2.1 Parallel Programming Model

There are two common programming models for converting the needs of Independent tasks into application threads:

● Domain Decomposition

● Functional decomposition

8.2.1.1 Domain Decomposition

Generally, large-scale intensive computing tasks use datasets that can be divided into small subsets. Each subset has a large computing independence. Examples:

● Calculate a discrete cosine transform (DCT) on a two-dimensional data, divide the two-dimensional data into several subsets, and create corresponding threads for each subset to calculate the transformation.

● Matrix Multiplication. Here, a thread can be created to use the multiplier matrix to perform the multiplication of the half matrix.

Domain decomposition is a programming model that independently processes smaller data fragments based on the creation of identical or similar threads. [Translator's note: here the so-called same (identical) means that the execution routines of several threads are all the same function.] This model can take advantage of the existing execution resources replicated in a traditional multi-processor system. It can also use the shared execution resources between two logical processors in the HT technology. This is because a data domain thread generally only consumes some of the available execution resources on the chip.

8.2.2 functional decomposition

Applications usually use different functions and many irrelevant datasets to process many different tasks. For example, a video codec requires several different processing functions. These include DCT, motion estimation, and color conversion. Using a functional thread model, applications can program independent threads for motion estimation, color conversion, and other functional tasks.

Feature breakdown enables more flexible thread-Level Parallelism if it relies less on copies of hardware resources. For example, a thread that executes a sorting algorithm and a thread that executes a matrix multiplication routine are unlikely to need the same execution unit at the same time. A design that recognizes this can take advantage of traditional multi-processor systems, or use a multi-processor system that supports HT technology.

8.2.3 dedicated programming model

The Intel Core Duo processor and Intel Core microarchitecture-based Processor provide an L2 cache shared by the two processor cores in the same physical package. This provides two application threads with the opportunity to access some application data while minimizing the bus traffic load.

Multi-threaded applications may need to use special programming models to take advantage of this type of hardware features. One scenario is producer-consumer. In this scenario, a thread writes data to a specific purpose (hopefully in the L2 cache ), the other thread that runs on other cores of the same physical package then reads the data produced by the first thread.

The basic method for implementing a producer-consumer model is to create two threads. One thread is the producer and the other is the consumer. Generally, the producer and consumer take turns to operate on a cache and notify the other party when they are ready to exchange the cache. In a producer-consumer model, some threads load synchronization when the cache is exchanged between the producer and the consumer. To achieve the optimal performance increase for a given number of cores, the synchronization load must be kept to a minimum. This ensures that the producer and consumer threads have a considerable number of time constants to complete each incremental task before switching the cache.

Example 8-1 describes the code structure of a series of task units executed in a single thread. Each Task Unit (either a producer or a consumer) is executed in serial mode (8-2 ). In equivalent scenarios with multiple threads, each producer-consumer pair is encapsulated as a thread function, and two threads can be simultaneously scheduled on available processor resources.

Example 8-1 serial execution of producer and consumer work items

For (I = 0; I <number_of_iterations; I ++) {producer (I, buff); // transfer cache index and cache address consumer (I, buff );}

8.2.3.1 producer-consumer thread model

Figure 8-3 describes the basic mode of interaction between a pair of producer and consumer threads. The horizontal direction indicates the time. Each block represents a task unit and processes the cache assigned to a thread.

The gap between each task indicates the synchronization load. The decimal number in parentheses represents a cache index. On an Intel Core Duo processor, the producer thread can store data in the L2 cache to allow the consumer thread to continue the job, minimizing the need for bus traffic.

The producer and consumer thread functions are implemented, and the basic structure of synchronization with the cache index is displayed in Example 8-2.

Example 8-2: implement the basic structure of the producer-consumer thread

// (A) the basic structure of a producer thread function void producer_thread () {int iter_num = workamount-1; // copy int mode1 = 1 locally; // use 0 and 1 to track the two caches using produce (buffs [0], count); // The placeholder function while (iter_num --) {signal (& signal1, 1 ); // tell another thread that the job can start produce (buffs [mode1], count); // placeholder function waitforsignal (& End, 1); mode1 = 1-mode1; // switch to another cache} // (B) the basic structure of the consumer thread void consume_thread () {int mode2 = 0; // The first iteration from cache 0, then switch int iter_num = workamount-1; while (iter_num --) {waitforsignal (& signal1); consume (buffs [mode2, count); // placeholder function signal (& end, 1); mode2 = 1-mode2;} consume (buffs [mode2], count );}

It is also possible to construct a producer-consumer model in an staggered form, which minimizes bus traffic and is also effective for multi-core processors that do not share L2 cache.

In this staggered version of the producer-consumer model, each scheduling volume of an application thread is composed of a producer task and a consumer task. Two identical threads are concurrently created for execution. During the scheduling of a thread, the producer task starts first, and then the consumer task follows the completion of the producer task. Both tasks perform the same cache operation. After each task is completed, one thread sends a signal to another thread to notify the corresponding task to use the cache assigned to it. Thus, producer and consumer tasks are executed in parallel in two threads. As long as the data generated by the producer resides in the L1 or L2 cache of the same core, consumers can directly access them without suffering from bus traffic. The scheduling of the staggered producer-consumer model is shown in Figure 8-4.

Example 8-3 shows the basic structure of a thread function that can be used in this staggered producer-consumer model.

Example 8-3: A staggered producer-consumer model thread function

// The Master thread starts the first iteration. The other thread must wait for an iteration void producer_consumer_thread (INT master) {int mode = 1-master; // trace which thread and its assigned cache index unsigned int iter_num = workamount> 1; unsigned int I = 0; iter_num + = Master & workamount & 1; if (master) // The Master thread starts the first iteration {produce (buffs [mode], count); signal (sigp [1-mode], 1); // notifies the consumer task following the thread, it can process consume (buffs [mode], count); signal (sigc [1-mode], 1); I = 1 ;}for (; I <iter_num; I ++) {waitforsignal (sigp [mode]); produce (buffs [mode], count ); // notification consumer task signal (sigp [1-mode], 1); waitforsignal (sigc [mode]); consume (buffs [mode], count ); signal (sigc [1-mode], 1 );}}

8.2.4 tools used to create multi-threaded applications

-- Note: This section is basically Intel's soft paper ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Intel64 and IA-32 Architecture Optimization Guide Chapter 1 multi-core and hyper-Threading Technology-8th programming model and Multithreading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Intel64 and IA-32 Architecture Optimization Guide Chapter 1 multi-core and hyper-Threading Technology-8th programming model and Multithreading

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support