Article Title: surpassing a single CUP: hyper-threading accelerates Linux. Linux is a technology channel of the IT lab in China. Includes basic categories such as desktop applications, Linux system management, kernel research, embedded systems, and open source.
Introduction Intel's hyper-threading technology allows a physical processor to contain two logical processors by replicating, partitioning, and sharing resources in the Intel NetBurst microarchitecture pipeline.
The copied resource creates a resource copy for two threads:
All architecture statuses of each CPU
Command pointer, rename Logic
Some small resources (such as the returned stack estimator and ITLB)
Partitioned resources are divided into resources between execution threads:
Several buffers (Re-Order buffer, Load/Store buffer, queue, etc)
Shared resources are used between two threads that are being executed as needed:
Disordered execution engine
High-speed cache
Generally, each physical processor has an architecture state at the core of a processor to provide services for threads. With HT, each physical processor has two architecture states on a single core, which makes the physical processor look like two logical processors are providing services for the thread. The system BIOS lists the status of each architecture in a physical processor. Because operating systems that support hyper-threading use a logical processor, these operating systems have twice the resources available to provide services for threads.
Hyper-threading support in Xeon Processors Among general purpose processors, Xeon processors are the first to implement synchronous multithreading (SMT) (see references for more information on the Xeon processor series ). To achieve the goal of executing two threads on a single physical processor, the processor maintains the context of multiple threads at the same time, which allows the scheduler to concurrently allocate two potentially unrelated threads.
The operating system (OS) Schedules and assigns multiple thread codes to each logic processor, just as in the SMP system. When no thread is dispatched, the related logic processor remains idle.
When a thread is scheduled and assigned to the logic processor LP0, hyper-Threading Technology uses the necessary processor resources to execute this thread.
When the second thread is scheduled and assigned to the Second logical processor LP1, resources are copied, divided, or shared as needed to execute the thread. Each processor selects at each point in the pipeline to control and process these threads. When each thread completes, the operating system sets unused processors to idle, releasing resources for running processors.
The OS schedules and distributes threads to each logic processor, as it does in a dual-processor or multi-processor system. When the System Scheduling thread is introduced into the pipeline, the resources are used as needed to process the two threads.
Hyper-threading support in Linux kernel 2.4 The Linux kernel regards hyper-threading processors with two virtual processors as a pair of real physical processors. The result is that the scheduling program processing SMP should also be able to process hyper-threading. Hyperthreading support in Linux kernel 2.4.x starts with 2.4.17 and includes the following enhancements:
128-byte lock alignment
Spiral wait loop Optimization
Non-execution-based delay Loop
Detects a processor that supports hyper-threading and starts a logic processor, as if the machine is an SMP
Serializing in MTRR and Microcode Update drivers because they affect the sharing status
Before Scheduling on a logical processor occurs, the scheduling program is optimized when the system is idle and the scheduling on the physical processor is prioritized.
Offset user stack to avoid 64 K Mixing
Kernel performance measurement
To assess the impact of hyper-threading on Linux kernel performance, we have measured the performance of kernel benchmarking programs on systems including Intel Xeon processors (with HT features. Hardware: SMT single CPU, 1.6 GHz Xeon MP processor, 2.5 gb ram, and two 9.2 gb scsi hard drive. The measurement kernel is to configure and build an existing SMP-supported kernel V2.4.19. Kernel hyper-threading supports specifying hyper-threading through the boot option acpismp = force, and specifying no hyper-threading through the boot option noht. To check whether hyper-threading is supported, run the command cat/proc/cpuinfo to check whether the processor 0 and processor 1 exist. Note that the ht flag in Listing 1 is used for CPU 0 and 1. If hyper-threading is not supported, only data of the processor 0 is displayed.
Listing 1. Output of cat/proc/cpuinfo, showing hyperthreading supports processor: 0
Vendor_id: GenuineIntel
Cpu family: 15
Model: 1
Model name: Intel (R) Genuine CPU 1.60 GHz
Stepping: 1
Cpu MHz: 1600.382
Cache size: 256 KB
...
Fpu: yes
Fpu_exception: yes
Cpuid level: 2
Wp: yes
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
Pge mca cmov pat limit 36 clflush dts acpi mmx fxsr sse sse2 ss ht
Tm
Bogomips: 3191.60
Processor: 1
Vendor_id: GenuineIntel
Cpu family: 15
Model: 1
Model name: Intel (R) Genuine CPU 1.60 GHz
Stepping: 1
Cpu MHz: 1600.382
Cache size: 256 KB
...
Fpu: yes
Fpu_exception: yes
Cpuid level: 2
Wp: yes
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
Pge mca cmov pat limit 36 clflush dts acpi mmx fxsr sse sse2 ss ht
Tm
Bogomips: 3198.15
Linux Kernel Benchmark Testing Program Five Benchmark Testing programs were used to measure Linux kernel performance: LMbench, AIM Benchmark Suite IX (AIM9), chat, dbench, and Tsung. The LMbench benchmarking program timing various Linux application programming interfaces (APIS), such as basic system calls, context switching latency, and memory bandwidth. The AIM9 benchmarking program provides measurement of the user's application workload. The chat Benchmark Test Program imitates the client/server workload of the chat room. The dbench Benchmark Test Program is the file server workload, and the Tsung is the TCP workload. Chat, dbench, and Tsung are multi-thread benchmarking programs, while others are single-thread benchmarking programs.
Impact of hyper-threading on Linux APIs
The impact of hyper-threading on Linux APIS is measured using LMbench, a micro-Benchmark Testing Program that contains a set of bandwidth and latency measurement methods. These impacts involve high-speed cache file reading, memory replication (bcopy), memory read/write (and latency), pipelines, context switching, networking, file system creation and deletion, Process Creation, signal processing, and processor clock latency. LMbench focuses on measuring the following kernel components: scheduler, process management, communication, networking, memory ing, and file systems. The low-level kernel primitive provides a good indicator of underlying hardware capabilities and performance.
To study the effects of hyper-threading, we focus on latency measurement, which measures the time of message control (in other words, how fast the system can perform an operation ). The number of latencies is recorded in "microseconds/operation.
Table 1 lists some kernel functions tested with LMbench. Each data point is the average value of three operations, and the data has been tested for convergence to ensure that they can be reproduced in the same test environment. Generally, there is no performance difference between super threads and no hyper-threading for functions that run as a single thread. However, for tests that require two threads to run (for example, pipeline latency tests and three process latency tests), hyper-threading seems to have extended their latency. The configured existing SMP kernel is labeled as 2419 s. If the configured kernel does not support hyper-threading, mark it as 2419s-noht. If hyperthread is supported, the kernel is listed as 2419s-ht.
Table 1. Effects of hyper-threading on Linux APIs
The pipeline latency test uses two processes that communicate through the UNIX pipeline to measure the communication latency between processes that pass through the socket. The benchmark program sends a flag back and forth between the two processes. The performance decline is 1%, which is negligible.
The three process tests include creating and executing processes in Linux. The purpose is to measure the time taken to create a basic control thread. For the fork + exit test of the process, the data indicates the latency of dividing a process into two (almost) identical copies and exiting one of them. The new process is created like this-but this is not very useful because the two processes perform the same operation. In this test, hyper-threading reduced the performance by 4%.
In the fork + execve process, data indicates the time it takes to create a new process and run the new program. This is the internal loop of all shell (command interpreter. As a result of hyper-threading, we can see that this test performance drops by 6%.
In the fork +/bin/sh-c test of the process, data indicates creating a new process and running the new program (by letting the System shell find the program and run it) the time it will take. This is how the C library interface calls the system. This call is the most common and time-consuming. Hyperthreading is 2% slower than hyperthreading.
Impact of hyper-threading on the workload of Linux single-user applications
The AIM9 benchmarking program is a single-user workload designed to measure the performance of hardware and operating systems. The result is shown in table 2. Most tests in this benchmark program have the same performance when hyper-threading is used and when hyper-threading is not used, but the synchronization file operation and Integer Sieve are different. Sync Random Disk Writes, Sync Sequential Disk Writes, and Sync Disk Copies) these three operations are nearly 35% slower when hyper-threading is used. Conversely, using hyper-threading in integer filtering is 60% faster than using no hyper-threading.
Table 2. Effects of hyper-threading on AIM9 workloads
Impact of hyper-threading on the workload of Linux multi-threaded applications
To measure the impact of hyper-threading on Linux multi-threaded applications, we use chat Benchmark Testing programs that mimic chat rooms. The benchmark test program includes the client and server. The client side of the benchmark test program will report the number of messages sent per second; the number of chat rooms and messages will control the workload. This workload creates many threads and TCP/IP connections, and sends and receives many messages. It uses the following default parameters:
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.