Linux Driver Power management Linux Power Management Architecture (3)

Source: Internet
Author: User

Device Power Management

Copyright (c) Rafael J. Wysocki<[email Protected]>, Novell Inc.

Copyright (c) Alan Stern[email protected]


This article was translated by Droidphone in 2011.8.5


Most of the Linux source code belongs to the device driver, so most of the power Management (PM) code is present in the driver. Many drivers may only do a small amount of work, others, such as battery-powered hardware platforms (mobile phones, etc.), will do a lot of work on power management.

This document gives a general description of how the driver interacts with the power management part of the system, especially the sharing of models and interfaces in the driver's core, and recommends that those involved in the driver-related field get the relevant background knowledge through this document.

Two models of device power management


The driver can use one of the models to bring the device into a low-power state:

1. System Sleep Model:

The driver is part of the system-level low-power state, like "suspend" (also known as "Suspend-to-ram"), or to a system with a hard drive that can enter "hibernation" (also known as "Suspend-to-disk").

In this case, the driver, bus, device class drive together, through a variety of device-specific suspend and resume methods, cleanly shut down the hardware device and each software subsystem, and then reactivate the hardware device without the data being lost.

Some drivers can manage wake-up events for hardware that can leave the system in a low-power state. This feature can be turned on and off with the corresponding/sys/devices/.../power/wakeup file (for the Ethernet driver, Ethtool through the IOCTL interface for the same purpose), enabling this function may result in additional power dissipation, But he gives the entire system more opportunities to enter a low-power state.

2. Runtime Power Management Model:

This model allows the device to enter a low-power state during system operation, in principle, he can be independent of other power management activities. However, there is usually no separate control between devices (for example, a parent device cannot enter suspend unless all of his child devices have entered the suspend state). In addition, depending on the bus type, some special operations may have to be done to achieve the purpose. If the device enters a low-power state during the system run phase, special processing must be made at the system-level power state migration (suspend or hibernation).

For this reason, not only the device driver itself, the corresponding subsystem (bus Type,device Type,device Class) Driver and power management Core will also be involved in rumtime power management work. For example, when the system sleeps, the above modules must cooperate with each other to achieve a variety of suspend and resume methods, in order to let the hardware into a low power state, wake up to continue to provide services without losing data.

We don't have much to say about the definition of low-power states, because they are usually system-specific and even specific to a device. If the system is running, enough of the devices enter a low-power state, the effect is very similar to the low-power state entering the system level. These drivers can take advantage of rumtime power management to get the system into a state of power-saving in a similar depth.

Most devices that enter the suspend state stop all I/O operations: There is no DMA or IRQ request (except for the wake-up system), no data reads and writes, and no longer accepts requests from the upper driver. This will have different requirements for different buses and platforms.

Some examples of hardware wake-up events: Alarms initiated by RTC, arrival of network packets, keyboard or mouse activity, insertion or removal of media (PCMCIA,MMC/SD,USB, etc.).

Interface to enter the system sleep state


The kernel provides programming interfaces for each subsystem (bus type,device type, device Class) and drivers so that they can participate in the power management of the devices they care about. These interfaces cover the system-level management of sleep and runtime levels.

Device Power Management operations


Device power management operations for subsystems and drivers are defined in the DEV_PM_OPS structure:

struct Dev_pm_ops {

Int (*prepare) (struct device *dev);

void (*complete) (struct device *dev);

Int (*suspend) (struct device *dev);

Int (*resume) (struct device *dev);

Int (*freeze) (struct device *dev);

Int (*thaw) (struct device *dev);

Int (*poweroff) (struct device *dev);

Int (*restore) (struct device *dev);

Int (*SUSPEND_NOIRQ) (struct device *dev);

Int (*RESUME_NOIRQ) (struct device *dev);

Int (*FREEZE_NOIRQ) (struct device *dev);

Int (*THAW_NOIRQ) (struct device *dev);

Int (*POWEROFF_NOIRQ) (struct device *dev);

Int (*RESTORE_NOIRQ) (struct device *dev);

Int (*runtime_suspend) (struct device *dev);

Int (*runtime_resume) (struct device *dev);

Int (*runtime_idle) (struct device *dev);


This structure is defined in include/linux/pm.h, and their role will be described in the next section. Now, let's just remember that the last three methods are dedicated to Rumtime pm, while others are used for system-level power state migrations.

In some subsystems, there is still the so-called "outdated" or "traditional" power management interface, which is not used in the dev_pm_ops structure, and only applicable to the system-level power management method, this article will not explain it, if you want to understand the source code directly to see the kernel.

subsystem level (Subsystem-level) method


The key method for device entry into suspend and resume in the PM members of the BUS_TYPE structure, device_type structure, and class structure, he is a pointer to a DEV_PM_OPS structure. In most cases, these are some of the concerns of the maintainers of specific bus architectures (such as PCI or USB or a device class and device classes).

Bus drivers implement these methods appropriately for hardware and drivers to use them, because PCI and USB have different ways of working. Only a handful of people write subsystem-level drivers; Most device drivers are built on the code of a variety of specific bus architectures.

These calls are described in more detail later, and they will be called in the device model tree of the parent-child form, one device at a time.

/sys/devices/.../power/wakeup files


All devices in the device model have two flags to control wake-up events (which can cause a device or system to exit a low-power state). The two flag bits are initialized by the bus or device driver with device_set_wakeup_capable () and device_set_wakeup_enable (), which are defined in Include/linux/pm_wakeup.h.

The "Can_wakeup" flag indicates that the device (or driver) physically supports wake-up events, and the device_set_wakeup_capable () function affects the flag. The "Should_wakeup" flag controls whether the device should attempt to enable his wake mechanism. Device_set_wakeup_enable () affects the flag. Most drivers do not actively modify their values. The initial value of should_wakeup for most devices is set to false, with exceptions such as the Power key, the keyboard, and the NIC with the Wake-on-lan function set by Ethtool.

The ability of the device to issue wake-up events is a hardware issue, and the kernel is only responsible for keeping track of the occurrence of these events. On the other hand, whether a wake-up device should initiate a wake-up event is a policy issue that is managed by user space through the SYSFS properties file (power/wakeup). The user space can be written to "enabled", or "disabled" to set or clear the SHOULE_WAKEUP flag, and accordingly, when the file is read, if the CAN_WAKEUP flag is true, the corresponding string is returned, if Can_wakeup is false, An empty string is returned to indicate that the device does not support wake events. (Note that even though an empty string is returned, the write to the file will still affect the SHOULD_WAKEUP flag)

The Device_may_wakeup () function returns true only if both flags are true. When the system migrates to sleep, the driver should pass this function check before allowing the device to enter a low-power state to determine whether the wake-up mechanism is enabled. However, in Rumtime power management mode, wake-up events are enabled regardless of whether the device and driver are supported or not, regardless of whether the SHOULD_WAKEUP flag is set.

/sys/devices/.../power/control files


Each device in the device model has a flag bit to control whether it belongs to the runtime power management mode. This flag called Runtime_auto is initialized by the bus type (or other subsystem) with Pm_rumtime_allow () or pm_rumtime_forbid (). The default value is RUMTIMEPM allowed.

User space can modify the flag bit by writing "on" or "auto" to the device's Sysfs file Power/control. Writing "Auto" is equivalent to calling Pm_rumtime_allow (), which allows the device to be rumtimepm by the driver. Write "On" is equivalent to calling Pm_rumtime_forbid (), the flag bit is cleared, the device will return to the full power state from the low power state, and the device is prevented from runtime power management. User space can also read the file to check the current value of the Runtime_auto.

The RUNTIME_AUTO flag of the device does not affect the system-level migration of power state. In particular, although the RUNTIME_AUTO flag is cleared, the device is also brought into a low-power state when the system-level power state is migrated to sleep.

For more information on the Runtime power management architecture, see Documentation/power/runtime_pm.txt.

Calling the driver to enter or exit the system sleep state


When the system goes to sleep, the system will ask the device driver to put the device into a state compatible with the target system to suspend (suspend) the device. This is usually some kind of "off" state. Specific situations are specific to each system. In addition, wake-up devices typically maintain some functionality so that the system can be awakened when appropriate.

When the system exits a low-power state, the device driver is asked to restore (resume) the device to get him into the full power state. Suspend and resume actions always happen together, and both can be divided into different stages.

For relatively simple drivers, suspend may use the upper class code in the SUSPEND_NOIRQ phase to stop the device and get them to the "off" state as much as possible. When awakened, the corresponding resume calls reinitialize the hardware and then reactivate their I/O activity.

Drivers with special requirements for power supplies may allow the device to make the necessary preparations so that wake-up events can be generated later.

Order of guaranteed callbacks


When the device enters suspend or resume, because the device has a certain bridging relationship, in order to ensure that they can access them correctly, suspend in the number of devices in the bottom-up order, while the resume is in the top-down order.

The order of devices in the number of devices depends on the order in which the devices are registered: The child device can never be registered, probed, or resume before the parent device, or it cannot be removed or suspended after the parent device.

The specific strategy is that the number of devices should match the bus topology of the hardware. In particular, this means that registering a child device fails when the parent device is in a pending action (for example, a device that has been selected by the PM's core to be suspended), or has been suspended. The device driver must handle this situation correctly.

Various stages of system power management


Suspend and resume are completed in stages. Standby, Sleep (Suspend-to-ram), and hibernation (Suspend-to-disk) are used in different stages. Before entering the next phase, you need to call the callback function for each device that belongs to this stage. Not all of the bus and device classes will support all of these callbacks, nor do all drivers use them. Some stages require a freeze process to execute before the thaw process. In addition, the *_NOIRQ stage needs to be executed when the IRQ is closed (unless they are irq_wakeup marked).

Most stages use the bus, type, and class callbacks (that is, defined in DEV->BUS->PM,DEV->TYPE->PM and DEV->CLASS->PM). But the prepare and complete phases were an exception, and they used only the bus callbacks. When more than one callback in a stage is to be executed, it is called in the following order, suspend: <class,type,bus>,resume when:<bus,type,class>. For example, the following sequence of calls will be executed when suspend:

Dev->class->pm.suspend (Dev);

Dev->type->pm.suspend (Dev);

Dev->bus->pm.suspend (Dev);

Instead, in the resume phase, before moving to the next device, the PM core is on the current device with the following callback:

Dev->bus->pm.resume (Dev);

Dev->type->pm.resume (Dev);

Dev->class->pm.resume (Dev);

These callbacks can, in turn, invoke the device or drive a specific method through DEV->DRIVER->PM, but this is not required.

System hangs (suspend)


When the system enters the standby or sleep state, it needs to go through the following stages:


1. The prepare phase mainly prevents the occurrence of the state by preventing the registration of the new equipment; If you are registering a sub-device at this point, the core of PM will not know that all the child devices of a device have been suspend. (instead, the device can be logged off at any time.) Unlike other stages of suspend, the prepare stage device tree is scanned from top to bottom.

The prepare phase only uses the bus callback. After the callback is returned, no new sub-devices can be registered under the device. The callback method also prepares the device or driver for the incoming system power state migration, but it should not allow the device to enter a low-power state.

2. The suspend phase is implemented by the suspend callback, which stops all I/O operations on the device. It can also save the device's registers, depending on the type of bus the device belongs to, allowing the device to enter the appropriate low-power state, while enabling wake-up events.

3. The SUSPEND_NOIRQ phase occurs after an IRQ is disabled, which means that the interrupt handling code for the driver is not called during the callback run. The callback method can save registers that were not saved in the previous stage and eventually put the device into the appropriate low-power state.

Most subsystems (subsystem) and drivers do not need to implement this callback. However, some bus types that allow devices to share interrupt vectors, such as PCI, typically require this callback; otherwise, the driver will get an error when the device is already low-power and another device-aware interrupt that is sharing the interrupt with him occurs.

At the end of these phases, the driver must stop all I/O transactions (Dma,irqs), save enough state information so that they can be reinitialized or revert to the previous state (as needed), and then put the device into a low-power state. On many platforms, they turn off some clocks, and sometimes turn off the power or reduce the voltage. (Drivers that support rumtime PM may have completed some or all of the steps in advance.) )

If Device_may_wakeup (dev) returns True, the device is ready to generate a hardware wake-up signal to trigger a system wake-up event to wake up a system that has entered sleep. For example, Enable_irq_wakeup () allows a gpio connected to a switch or external hardware to be captured, and Pci_enable_wake () responds to signals such as PCI PME.

As long as one of these callbacks returns an error, the system does not enter the low-power state, but instead initiates a resume action from the core of the PM to the already suspend device.

Exit system hang (resume)


When the system exits the standby or sleep state, it needs to go through the following stages:


1. The Resume_noirq callback method should perform all the necessary actions before the interrupt handler is called. This usually means undoing the actions of the SUSPEND_NOIRQ phase. If the bus type allows sharing of interrupt vectors, such as PCI, the callback method should enable the device and driver to identify whether they are the source of the interrupt, and if so, to handle them correctly.

For example, for a PCI bus, BUS->PM.RESUME_NOIRQ () lets the device enter the full power state (called D0 in PCI) and responds to the device's standard configuration register. Then, call the device driver's->pm.resume_noirq () method to perform the device-specific action.

2. The resume callback method allows the device to return to his working state so that it can perform normal I/O. This is usually equivalent to performing the suspend phase of the undo work.

3. The complete phase uses only bus callbacks. This method should undo the actions taken during the prepare phase. Note, however, that the new device may be registered immediately after the resume callback is returned, rather than waiting for the complete phase to be completed.

After these phases, the driver should be the same as before suspend: I/O can be performed through DMA or IRQs and the corresponding clock is opened. The device should return to the full power state after the device has been in a low-power state since runtime pm, even before the system sleeps. There are a number of reasons why you should do this, for detailed discussion please refer to: Documentation/power/runtime_pm.txt.

However, after this, the specific will be platform-specific. For example, some systems support multiple "run" states, and the resume mode may be different from before suspend. This can be a change in some clocks or power supplies, which can easily affect how the driver works.

The driver needs to be able to handle situations where the hardware is reset after the suspend callback is called, for example, it needs to be completely reinitialized. This may be the most difficult part, and implementation details may be protected by documents such as NDA and Chip errata. The simplest case is that the state of the hardware has not changed since suspend was executed, which is not guaranteed (in fact, this is usually not true).

Regardless of whether it is physically possible, the driver is also ready to be known when the system Power-down during the device removal. In Linux, PCMCIA,MMC,USB,FIREWIRE,SCSI and even the IDE are examples of removable. The specific information about how the driver is known, and the handling of such removal events, is bus-specific and usually has a separate thread to handle.

Enter hibernation



Exit hibernation



System equipment


System Devices (Sysdevs) follow a slightly different API, which can be found in the following files:



System equipment to be suspend in the event of an interrupt shutdown, and to be executed after other devices are suspended, when awakened, they will be resume before other devices, and, of course, in case of a shutdown. These actions take place in a particularly "sysdev_driver" phase, which only works for system devices.

Therefore, after the SUSPEND_NOIRQ (FREEZE_NOIRQ,POWEROFF_NOIRQ) phase, when the CPUs of the non-booting (non-boot) are closed and the remaining CPU IRQs is closed, the SYSDEV_ is started. Driver.suspend stage, the system goes to sleep (the system image is created for hibernation). The order of the resume period is: Sysdev_driver.resume stage execution, turn on the IRQ that started the CPU, open the other non-bootable CPUs, and then start the RESUME_NOIRQ phase.

Code that actually enters and exits the system-level low-power state sometimes calls some only boot firmware (BIOS? Bootloader Only know the hardware operation, and then keep the CPU running a software (from RAM or Flash) to monitor the system and manage the wake-up sequence.

Device low power (suspend) status


There is no standard for the low-power state of the device. A device can handle only "on" and "off", but another device may support a dozen different versions of "on" (how many engines are activated?). ), plus a state that can go back to "on" more quickly than "off" completely.

Some buses define a number of rules for different suspend states. PCI can give an example: After the suspend sequence is complete, a non-traditional (non-legacy) Del PCI device can not perform DMA or emit IRQs, and the wake-up event is emitted via the pme# bus signal. Several PCI standard device states are also defined, some of which can be just as an option.

Conversely, a higher-integration SOC processor often uses IRQs as the wake-up source (so the driver calls Enable_irq_wake ()) and can use the DMA completion interrupt as a wake-up event (sometimes the DMA can remain active, but the CPU and some peripherals go to sleep).

Here are some details that can be platform-specific. In some sleep states, the system can have some devices remain active, such as the system light sleep, the LCD monitor will use DMA to continue to refresh, frame buffer may even have a DSP or another non-Linux CPU to refresh, But the CPU running Linux can be in idle state.

Again, depending on the state of the different target systems, some special things can happen. Some target system states can allow the device to have a lot of operational activity, and some target system states may require a hard shutdown and then re-initialize when resume. Also, two different target systems can use the same device in different ways; like the LCD mentioned above, he can remain active under the "standby" of a product, but another product that uses the same SOC may work differently.

Power Management notification messages


Some operations cannot be carried out in the power management callback methods discussed above because callbacks occur too late or too early. To handle these situations, subsystems and drivers can register power management notifications to invoke an action before the process is frozen or after it is thawed. In general, PM notification mechanisms are appropriate for performing activities that can be exploited by user space, or at least not interfering with user space activities.

Refer to document Documentation/power/notifiers.txt for detailed instructions.

Runtime Power Management


Many devices can be turned off dynamically while the system is running, which is especially useful for devices that are not already in use, and allows the running system to conserve energy more efficiently. These devices typically support a range of runtime power states, such as "Off", "Sleep", "idle", "active" and so on, which are sometimes constrained by the bus used by the device, and typically contain the hardware state used by the system-level sleep.

System-level power state migrations can begin when some devices enter a low-power state because of RUMTIMEPM. The PM callbacks for system sleep should be able to identify this situation and reactivate them in an appropriate way, but these actions are specific to each subsystem.

Sometimes this is determined by the subsystem level, and sometimes it's up to the device driver to decide that when a system-level power state is migrated, an already suspend device retains a state of attention, while others may temporarily return the device to a full power state, such as the ability to disable it from waking the system. These depend on the specific hardware and subsystem design, is the driver to pay attention to the problem.

When the system wakes from sleep, it is best to bring the device back to the full power state, please refer to document Documentation/power/runtime_pm.txt for explanation. This document has a more detailed discussion of these issues and also explains the common architecture of runtime power management

Linux Driver Power management Linux Power Management Architecture (3)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.