Intel64 and IA-32 Architecture Optimization Guide Chapter 1 multi-core and hyper-Threading Technology-8th performance and use Models

Source: Internet
Author: User

This chapter describes the software optimization technology for Multithreaded Applications that run in a multi-processor (MP) system or a processor environment with hardware-based multithreading support. A multi-processor system is a system with two or more slots, each with a physical processor package. Intel 64 and IA-32 processors with hardware multithreading support, including dual-core processors, quad-core processors, and processors supporting HT technology [Note: the existence of hardware multithreading support in Intel 64 and IA-32 processors can be detected by checking cpuid1_1h: edX [28] feature marks. The return value of BITs 28 indicates that at least one hardware multithreading exists in the physical processor package. The number of logical processors in each package can also be obtained from the cpuid. The application must use an appropriate OS call to check how many logical processors are allowed and available to the application.]

Computing throughput can increase in a multi-threaded environment, because more hardware resources are added to utilize thread-level or task-level parallelism. Hardware resources can be added in multiple forms: physical processor, processor core of each package, and/or logical processor of each core. Therefore, multi-threaded optimization involves application across MP, multi-core, and HT technologies. Some specific architecture resources can be implemented in different hardware multi-thread configurations (for example, execution resources are not shared across different cores, but two logical processors in the same core can be shared, if HT technology is allowed ). This chapter covers the principles for applying these scenarios.

This chapter covers

● Performance characteristics and usage model

● Programming model for multi-threaded applications

● Software Optimization Technology in five specific fields

8.1 performance and usage model

The performance gains using multiple processors, multi-core processors, or HT technology are greatly affected by the concurrency in the use model and the control flow of the workload. Two common models are:

● Multi-threaded applications

● Multi-task with single-thread application

8.1.1 Multithreading

When an application introduces multithreading to mine concurrent jobs of a workload, the control flow of The multithreading software can be divided into two parts: parallel jobs and sequential jobs.

Amdahl's law describes the performance gain of an application when it associates it with the degree of parallelism in the control flow. It is a useful guide for Selecting code modules, functions, or command sequences, however, we are most likely to realize that converting these tasks from sequential tasks and control flows into parallel code is most beneficial to the use of multi-threaded hardware support.

Figure 8-1 describes how performance gains can be recognized for workloads according to Amdahl's law. Entries in Figure 8-1 indicate an independent task unit or all the workloads of the entire application.

Generally, the speed of Running multiple threads on an MP system with N physical processors can be expressed:

Here, p is the parallel part in the load, and O represents the multi-thread load. [Note: including thread creation, switching, synchronization, communication, and destruction], it varies in different operating systems. In this case, the performance gain is the reciprocal of the relative response.

When the application performance is optimized in a multi-threaded environment, the number of parallel control flows relative to the physical processor and the number of logical processors per physical processor may have the greatest impact on the performance scaling ratio.

If the control flow of a multi-threaded application contains a workload where only 50% of the load can be executed in parallel, the maximum performance gain of two physical processors is only 33% compared to that of a single processor. The use of four processors can only speed up to 60% relative to a single processor. Therefore, it is very important to maximize the parallelism that can be used by the control flow. Improper implementation of thread synchronization may significantly increase the proportion of the serial control flow and thus reduce the performance increment of applications.

In addition to maximizing the concurrency of the control flow, interactions between threads in the form of thread synchronization and Unbalanced Task Scheduling also have a huge impact on the performance measurement of the entire processor.

Excessive cache failure is a factor that leads to unsatisfactory performance growth. In a multi-threaded execution environment, they occur in the following situations:

● Stack access that is obfuscated by different threads in the same process [Note: Each thread in a process has an independent stack space. Therefore, when multiple different threads are executed in parallel, stack access may affect the cache they share.]

● Cache row eviction due to thread Competition

● Cache row error sharing among different processors

Technologies for these cases (and many other fields) are described in each section of this chapter.

8.1.2 multi-task environment

Hardware multithreading in Intel 64 and IA-32 processors, when a workload is composed of several single-threaded applications, in addition, when these applications are scheduled and run concurrently in an operating system with MP capabilities, task-Level Parallelism can be used. In this environment, the hardware multithreading feature can provide higher workload throughput, although a single task (relative to the time the same task was completed in a single-threaded environment) the relative performance varies depending on the number of shared execution resources and memory available.

For development purposes, several mainstream operating systems include OS kernel code for managing task scheduling and sharing execution resource balance within each physical processor to Maximize throughput.

Because applications run independently in a multi-task environment, thread synchronization is unlikely to limit the increase or decrease of throughput. This is because the control flow of a workload may be 100% parallel [Note: a software tool that attempts to measure a multitasking workload may introduce a non-parallel control flow. Thread Synchronization must be considered as a necessary part of its performance measurement method.] (If there is no inter-Processor Communication and there is no system bus limit ).

However, with a multi-task workload, bus activity and cache access modes may affect the increase or decrease of throughput. Running two copies of the same application or running the same application in one lock step exposes a manual performance measurement method. This is because the access mode to the first-level data cache will lead to too many cache failures, resulting in skewed (asymmetric) performance results. To solve this problem, follow these steps:

● Include the offset of each instance at the beginning of an application

● Introduce workload heterogeneity by using different datasets of each instance of the application.

● When multiple copies of the same application are run, the startup sequence of the application is randomized.

When two applications are used as a multi-task workload, there is only a small synchronization load between the two processes. It is also important to ensure that each application has the least synchronization load within it.

An application that uses lengthy spin lock loops for intra-process synchronization is unlikely to benefit from the HT technology in a multitasking load. This is because critical resources will be consumed by a long spin lock loop.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.