Summary of Power Management Technology in Linux 2.6 kernel

Source: Internet
Author: User

Preface

 

This series of articles will combine the new energy-saving technologies that have been continuously added to various hardware (including the latest Bus Standards and peripherals such as CPU, chipset, and PCI Express) in recent years.

From the Linux 2.6 kernel and the entire software stack (including kernel, middleware, and various user States)
How to add support for these innovative energy-saving technologies
In recent years, the operating system has made great progress in power management and its future development direction.

As the beginning of this series of articles, we should first introduce cpufreq to you, linux 2.6 kernel is a new kernel subsystem to better support the variable frequency technology that has emerged in mainstream CPU processors in recent years.

Origin of cpufreq

With energy efficient computing and performance per watt
Advanced Configuration and power
Interface) standard development, the current market mainstream CPU provides frequency conversion (frequency)
Scaling) technical support. For example, the enhanced speedstep technology supported by Intel processor and
Powernow! Technology, like the latest PowerPC, arm
And Other Processors also provide similar support. The current Linux
The 2.6 kernel supports a processor with Variable Frequency Technology. It should be noted that the frequency conversion technology discussed here is different from the frequently-known overclocking technology. Overclocking refers to increasing the core voltage.
The process of making the processor work at a non-standard frequency may lead to serious consequences such as shortening the CPU service life and reducing the system stability.

However, the frequency conversion technology means that the CPU Hardware supports running at different frequencies, during the operation, the system can dynamically switch between these different operating frequencies based on the system load situations that may change at any time, so as to achieve both performance and power consumption.


Although multiple processor manufacturers provide support for the frequency conversion technology, there must be a slight or even huge difference in hardware implementation and use methods. This makes every processor manufacturer need to follow its special
Hardware Implementation and usage to add code to the kernel, so that the frequency conversion technology in your product in Linux
. However, the consequence of this kernel development mode is that the implementation code of various manufacturers is scattered in Linux
In every corner of the kernel code tree, no code is shared between different implementations, this brings huge overhead to kernel maintenance and support for new products in the future, and directly causes
The birth of cpufreq kernel subsystem. In fact, as mentioned above, the purpose of the frequency conversion technology is to enable the system to dynamically adjust the CPU at any time according to system load changes during operation.
. This can be divided into two parts: one is the question of "what to do" and the other is the question of "how to do. "What to do" refers to how to select based on the dynamic changes of the system load
The CPU runs at an appropriate frequency, and the "how to do" means
Set to make it really work at this frequency. This is the mechanism and policy that we often encounter in software design.
The well-designed software will ensure that the two are clearly isolated and communicate with each other through standardized interfaces.

Design and use of cpufreq

To solve the problems mentioned above, cpufreq, a new kernel subsystem, came into being. Cpufreq provides a unified design framework to better support Variable Frequency Technology of different CPUs in the Linux kernel. Its software structure 1 is shown in.

Figure 1. Software Structure of cpufreq
 

As shown in 1, cpufreq is mainly divided into the following three modules:

Cpufreq Module
The supported frequency conversion technology and how to dynamically select an appropriate operating frequency on the upper layer based on the system load for encapsulation and abstraction, and define clear interfaces between the two, in this way,
The separation of mechanisms and policies.

At the underlying layer of the cpufreq module, each CPU manufacturer only needs to provide its CPU
Related Frequency Conversion drivers (CPU-specific drivers), such as Intel needs to provide support for enhanced speedstep Technology
CPU driver, while amd needs to provide support for powernow! Technical CPU driver.

In the upper layer of the cpufreq module, governor is used as the decision maker to select the appropriate target operating frequency. according to certain standards, the CPU is selected at the appropriate time.
Suitable for running frequency, and use the interface defined by the cpufreq module to operate the underlying CPU-related Variable Frequency Driver, set the CPU to run on the selected running frequency.

Currently, the latest Linux Kernel provides performance and powersave
, Userspace, conservative, and OnDemand governors are available for users to choose from.
The appropriate operating frequency uses different standards and is applicable to different application scenarios. You can select only one governor at a time.
But you can switch to another Governor based on changes in application requirements during system operation.

The benefit of this design is that Governor and CPU
The development of related Variable Frequency drivers can be independent from each other and code reuse can be maximized. kernel developers will no longer fall into a specific type when writing and experimenting with the new governor.
CPU frequency conversion technology hardware implementation details, and the CPU manufacturer added to the Linux kernel to support its specific CPU
Variable Frequency Technology only provides a relatively simple driver, without the need to consider how to choose the appropriate operating frequency in different application scenarios.

The cpufreq subsystem in the kernel provides the user interface through the sysfs File System to the upstream application. For every CPU in the system, its
The sysfs user interface of cpufreq is located in the/sys/devices/system/CPU/cpux/cpufreq/directory, X indicates
Processor ID, which corresponds to the information in/proc/cpuinfo. Taking cpu0 as an example, users usually observe the following files in this directory:

$ LS-F/sys/devices/system/CPU/cpu0/cpufreq/
Affected_cpus
Cpuinfo_cur_freq
Cpuinfo_max_freq

Cpuinfo_min_freq
OnDemand/
Scaling_available_frequencies
Scaling_available_governors
Scaling_cur_freq
Scaling_driver
Scaling_governor
Scaling_max_freq
Scaling_min_freq
Stats/

All readable files can be read using the cat command, and all writable files can be read using echo
Command. Cpuinfo_max_freq and cpuinfo_min_freq provide the CPU
The maximum and minimum operating frequencies supported by the hardware. cpuinfo_cur_freq reads the current operating frequencies of the CPU from the CPU hardware register. Although
CPU Hardware supports a variety of different operating frequencies, but in some cases, you can select only one subset. This control is implemented through scaling_max_freq and
Scaling_min_freq. When selecting an appropriate running frequency, Governor will only
And scaling_min_freq determine the frequency range, which is scaling_available_frequencies
The displayed content. Unlike cpuinfo_cur_freq, scaling_cur_freq returns the CPU cached by the cpufreq module.
The current operating frequency, instead of checking the CPU hardware register. Scaling_available_governors will tell the user what is currently
Governors is available for users, while scaling_driver displays the variable frequency driver used by the CPU. The stats Directory provides
Statistics on the usage of various CPU running frequencies, such as the running time of the CPU at various frequencies and the frequency of conversion between various frequencies. The OnDemand directory is
OnDemand governor is related, which will be introduced later.

Through the above introduction, we will discuss how to use cpufreq through sysfs
The provided user interfaces have a rough understanding, but for most users, operating these files one by one is both laborious and time-consuming. Therefore, Dominik
The cpufrequtils toolkit [2] has been developed by et al. to provide users with a simpler operation interface for the kernel cpufreq subsystem. Use cpufreq-Info
The reader can clearly see the content of each file in the/sys/devices/system/CPU/cpux/cpufreq/directory just introduced above.

$ Cpufreq-Info
Cpufrequtils 002: cpufreq-Info (c) Dominik brodoski

2004-2006
Report errors and bugs
Linux@brodo.de
, Please.
Analyzing CPU 0:
Driver: ACPI-cpufreq
CPUs which need to switch frequency at the same time:

0 1
Hardware limits: 1000 MHz-1.67 GHz
Available frequency steps: 1.67 GHz, 1.33 GHz, 1000

MHz
Available cpufreq governors: userspace, conservative,

OnDemand, powersave, performance
Current policy: frequency showould be within 1000 MHz

And 1.67 GHz.
The governor "OnDemand" may decide which

Speed to use
Within this range.
Current CPU frequency is 1000 MHz.
Analyzing CPU 1:
Driver: ACPI-cpufreq
CPUs which need to switch frequency at the same time:

0 1
Hardware limits: 1000 MHz-1.67 GHz
Available frequency steps: 1.67 GHz, 1.33 GHz, 1000

MHz
Available cpufreq governors: userspace, conservative,

OnDemand, powersave, performance
Current policy: frequency showould be within 1000 MHz

And 1.67 GHz.
The governor "OnDemand" may decide which

Speed to use
Within this range.
Current CPU frequency is 1000 MHz.

The Origin and implementation of OnDemand Governor are just displayed in the output of cpufreq-info.
The subsystem provides five governors for users to choose from.
Userspace, conservative, OnDemand, powersave
And performance. In the latest kernel, if you do not perform any additional settings, OnDemand will be used as the default Governor
. To understand the cause of this situation, we will lead the readers to review the development history of governor in the CPU freq subsystem in the kernel.

Cpufreq, as a sub-system, was first added to the Linux kernel with only three governors
, Including performance, powersave, and userspace. When you choose to use performance Governor
The CPU will work at the maximum operating frequency supported by it. When the user chooses to use powersave governor
The CPU will work at the lowest supported running frequency. Therefore, both types of Governors belong to static governor, that is, when they are used, the CPU
The Running frequency of is not dynamically adjusted according to the changes in the system runtime load. These two types of Governors correspond to two extreme application scenarios, using performance
Governor represents the greatest pursuit of high system performance, and uses powersave Governor
Is the biggest pursuit of low power consumption of the system. Although these two application requirements do exist, most users need a more flexible Variable Frequency strategy most of the time. The earliest cpufreq sub-system passed
Userspace Governor provides users with this flexibility. Like its name, use userspace Governor
The system gives the decision-making power of the variable frequency policy to the user-state application, and provides corresponding interfaces for the user-state application to adjust the CPU running frequency. By using cpufrequtils
Cpufreq-set in the Toolkit sets userspace as the Governor used by the cpufreq Subsystem
Then, we can see that there is an additional name in the/sys/devices/system/CPU/cpux/cpufreq/directory
Scaling_setspeed file, which is the special user interface provided by userspace governor. You can write any
The Running frequency supported by scaling_available_frequencies to set the CPU to run at this frequency.

# Cpufreq-set-G userspace
# Cat cpuinfo_cur_freq
1000000
# Cat scaling_available_frequencies
1667000 1333000 1000000
# Echo 1333000> scaling_setspeed
# Cat cpuinfo_cur_freq
1333000

I just mentioned using userspace Governor
The system gives the decision-making power of the variable frequency policy to the user-state application. This user-mode application is generally a daemon
Program to collect system information at a certain interval and use scaling_setspeed provided by userspace Governor Based on the system load
The interface dynamically adjusts the CPU running frequency. As the daemon program, powersaved or
Cpuspeed. The two daemon programs generally count the CPU load in this sampling period every few seconds, and adjust the CPU according to the statistical results.
. This kind of userspace governor and user State daemon
Although the variable frequency method of the program provides users with certain flexibility, the two serious defects of the method are gradually exposed through the feedback obtained from the extensive use of open-source communities. The first is performance issues. Example
For example, if powersaved performs a sampling analysis of system load every five seconds, we can analyze the user experience in the following application scenarios. Hypothesis
The sampling analysis of powersaved has just ended, and the CPU is set to run at the lowest frequency due to the low system load during the just-concluded sampling period. If you open
For programs with high CPU computing capability requirements such as Firefox, powersaved will have the opportunity to observe this improvement after the next sampling point-about five seconds.
CPU running frequency requirement. That is to say, the CPU usage within five seconds at the beginning of Firefox startup
The computing capability is not fully utilized, which will undoubtedly compromise the user experience. The second is the accuracy of sampling and analysis of system load. Monitoring System Load and monitoring of future CPU
In fact, it is not reasonable to hand over a user-State program to determine the performance requirements. On the one hand, it is difficult for a user-State program to completely collect all the required information, because most of the information is saved
In the kernel space, if a user-State program wants to collect the system information, data interaction between the user State and the kernel state is required, frequent data exchanges between user and Kernel
System performance has a negative impact.


Are there any solutions to these two problems? It should be noted that the developers in the community can easily reach an agreement on the second question. Since the user State collects and analyzes the system load, there is such a question.
Therefore, the kernel should be responsible for this part of the work. But what's the first problem? The most intuitive solution to the first problem is to reduce the time interval between sampling and analysis of system loads.
Sample powersaved
In this way, you can respond to system load changes as soon as possible. However, this simple solution to reduce the time interval of sampling analysis also has two problems. On the one hand, this means that more frequent user States and Kernel
The data interaction between States, therefore, must have a greater negative impact on the system performance. The other reason is that the CPU
The Frequency Conversion Technology of the manufacturer is still not complete in hardware. The specific manifestation is that the operation time required for the CPU frequency conversion settings is too long, such as the early intel speedstep
It takes 250 microseconds to set the frequency of the CPU. During this process, the CPU cannot execute commands normally. It is not difficult for the reader to simply calculate, even if a clock speed is
For 1 GHz CPUs, 250 microseconds also means 250,000 clock cycles, during which the CPU
Tens of thousands of commands can be executed. Therefore, from this perspective, it is easier to reduce the time interval of sampling analysis, which has a more serious negative impact on system performance. Fortunately, with the continuous improvement and improvement of hardware technology
For example, Intel's latest enhanced speedstep technology is
The frequency conversion setting time has been reduced to 10 microseconds, which is more than one order of magnitude. This type of CPU
The development of hardware technology provided an opportunity for Kernel developers to solve these legacy problems. Venkatesh and others proposed and designed a new OnDemand
Governor, which has long been expected to work completely in the kernel state and can sample and analyze the system load at a finer interval.
Governor. Before introducing the specific implementation of OnDemand governor, let's take a look at how to use OnDemand Governor
And the operation interfaces provided to users. After you set OnDemand to the currently used governor through cpufreq-set
A subdirectory named OnDemand appears in the/sys/devices/system/CPU/cpux/cpufreq directory.

$ Sudo cpufreq-set-G OnDemand
$ Ls/sys/devices/system/CPU/cpu0/cpufreq/OnDemand/
Ignore_nice_load
Powersave_bias
Sampling_rate
Sampling_rate_max
Sampling_rate_min
Up_threshold
$ Sudo cat sampling_rate_min sampling_rate

Sampling_rate_max
40000
80000
40000000
$ Sudo cat up_threshold
30

The three files named with sampling headers in this subdirectory provide the minimum sampling interval allowed by OnDemand governor, the currently used sampling interval, and the maximum allowed sampling interval, all three are in microseconds.

Taking the author's computer as an example, OnDemand Governor samples every 80 milliseconds. Another important file is
Up_threshold, which indicates the percentage of the system load that exceeds when the OnDemand Governor automatically increases the CPU
. Take the author's computer as an example. The value is 30%. How does this indicate the percentage of system load? In support of Intel's latest enhanced
In the cpu Of The speedstep technology, two MSR registers (model specific register) are directly provided in the processor hardware.
OnDemand Governor sample and analyze the system load usage. The two MSR registers are named ia32_mperf and
Ia32_aperf [5], where the mperf in ia32_mperf MSR represents maximum performance,
The aperf in ia32_aperf MSR represents actual performance. Just like the two MSR names,
The ia32_mperf MSR register is a set
The maximum operating frequency supported by the hardware adds a counter every clock cycle. The ia32_aperf MSR register is a counter that is used when the CPU is in the ACPI C0 status.
The actual operating frequency of the CPU hardware is increased by a counter every other clock cycle. With the existence of these two registers, consider whether the CPU is in ACPI C0 or ACPI
The ratio of time in C1, C2, and C3 states, that is, the ratio of time when the CPU is in the working and sleep states, OnDemand Governor
You can accurately calculate the CPU load.

After obtaining the CPU load, the next question is how to select the CPU
Suitable operation frequency. As mentioned above, when the system load exceeds the percentage set by up_threshold, OnDemand governor will automatically increase
CPU running frequency, but to which frequency does the CPU run? On OnDemand governor, the system load exceeds
When up_threshold is set as a percentage, it indicates that the user needs CPU to provide more powerful processing capabilities, so OnDemand Governor
The CPU is set to run at the highest frequency. developers and users in this community have no objection. However, when OnDemand Governor
Which of the following frequencies should the system load decrease be detected to reduce the CPU operation frequency? OnDemand Governor
The initial implementation is to reduce the frequency to the next available frequency within the Optional Frequency range. For example, the CPU used by the author supports three optional frequencies: 1.67 GHz, 1.33 GHz, and
1 GHz. If OnDemand Governor finds that the CPU runs at 1.67 GHz, then 1.33 GHz
The target frequency for downgrading. The dominant idea of this frequency reduction strategy is to minimize the negative impact on the system performance, so that the system performance will not be quickly reduced in a short time to affect the user experience. However
After the first implementation version of OnDemand governor was released in the community, a large number of users showed that this concern was actually redundant. OnDemand
Governor can choose a more radical target frequency when downgrading. Therefore, the latest OnDemand Governor
In the case of frequency reduction, one-time selection of all optional frequencies will ensure that the CPU is working at more than 80% of the load frequency. Of course, if no one of the optional frequencies meets the requirements, the CPU will be selected.
The minimum operating frequency supported. The test results of a large number of users show that this new algorithm can achieve more efficient energy saving without affecting the system performance. After the algorithm is improved, OnDemand
The name of Governor has not changed, while OnDemand Governor
The initial implementation was also saved and named conservative because of its conservative algorithm.

Implementation of CPU drivers supporting intel enhanced speedstep Technology
The software structure of CPU freq has pointed out that the policy of CPU frequency conversion is separated from the mechanism from the governor
Determines the proper CPU operating frequency. However, when Governor decides to adjust the CPU running frequency based on system load changes, the underlying and CPU
The related driver completes the task of setting the CPU running frequency. Here, we will introduce the latest intel enhanced speedstep technology.
The implementation principle of the CPU driver focuses on how to set the frequency conversion of the CPU. Actually supports intel enhanced speedstep
The technical processor provides users with a very simple programming interface. The CPU running frequency is set through a MSR named ia32_perf_ctl.
In addition, an MSR register named ia32_perf_status can be used to check the Running frequency of the CPU. When you need
When setting the Running frequency, you only need to write the specified value to the ia32_perf_ctl MSR register according to the instructions in the Intel development manual.

Summary and future development direction

This article introduces the appearance of the Variable Frequency Technology on the CPU hardware and various problems in the initial implementation of the Linux kernel.
The birth of cpufreq, a new kernel subsystem. Although the three governors provided by the cpufreq module in the Early Stage
To some extent, it can meet user needs and provide some flexibility. However, due to the limitations of CPU hardware technology at that time, there are still many unsatisfactory aspects. With the CPU
With the continuous development of Variable Frequency hardware technology, especially the emergence of Intel enhanced speedstep technology, the original technical barriers have been broken.
The cpufreq kernel subsystem has a brand new, more comprehensive and efficient OnDemand governor.

It is not difficult to see that the CPU freq subsystem in the kernel is due to the emergence of the CPU Hardware frequency conversion technology, and also
The development of hardware frequency conversion technology. In fact, this also indicates the future development direction of the cpufreq subsystem in the kernel, that is, continuing to follow the CPU
The development of hardware frequency conversion technology keeps pace with the times. At the end of this article, I will briefly introduce a new technology that supports hardware Frequency Conversion in Intel's latest CPU. As mentioned above
Intel's latest enhanced speedstep technology provides two CPUs named ia32_mperf and ia32_aperf respectively.
MSR registers to provide direct hardware support for the Governor used by the cpufreq module to dynamically collect system load. Among them, ia32_aperf
When the CPU is in the ACPI C0 status, the MSR registers add one to each clock cycle according to the actual operating frequency of the CPU hardware. Intel
The latest processor further considers that when the pipeline may stop during CPU operation due to access to memory or Io, ia32_aperf
In the past, this simple method of adding one per clock cycle according to the actual CPU running frequency does not fully reflect the CPU load. At Intel
If the pipeline is suspended in the latest processor, the ia32_aperf will temporarily stop accumulating, but will increase in the real "useful" time period for the user, so that the CPU
The hardware can provide more accurate system load statistics for the Governor used by the cpufreq module.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.