Changes in the hrtimer and kernel clock/Timer subsystems

Source: Internet
Author: User
Tags posix

The arm arch in kernel-2.6.22 has added support for dynticks, clocksource/event. We have found some recent changes to the kernel clock and timer subsystems.
Generally, soft-timer (timer wheel/hrtimer) is composed of hardware-timer (such as clock interruption) and related clock source (E. g GPT in SOC) driver, so I plan to start from the clock layer, then soft-timer, kernel timekeeping, and finally let's look at some applications.

Clock Source
Clock source defines the basic attributes and behaviors of a clock device. These clock devices generally have the counting, timing, and interrupt capabilities, such as GPT. The structure is defined as follows:


struct clocksource {
char *name;
struct list_head list;
int rating;
cycle_t (*read)(void);
cycle_t mask;
u32 mult; /* cycle -> xtime interval, maybe two clock cycle trigger one interrupt (one xtime interval) */
u32 shift;
unsigned long flags;
cycle_t (*vread)(void);
void (*resume)(void);

/* timekeeping specific data, ignore */
cycle_t cycle_interval; /* just the rate of GPT count per OS HZ */
u64 xtime_interval; /* xtime_interval = cycle_interval * mult. */
cycle_t cycle_last ____cacheline_aligned_in_smp; /* last cycle in rate count */
u64 xtime_nsec; /* cycle count, remain from xtime.tv_nsec
* now nsec rate count offset = xtime_nsec + * xtime.tv_nsec << shift */
s64 error;
};

The most important members are read (), cycle_last, and cycle_interval. the interface for reading the current Count value of the clock device count register is defined to save the previous Count value and the interval value of each tick cycle. the values in this structure, whether cycle_t or u64 (actually cycle_t is u64), are counted values (cycle), rather than nsec, SEC and jiffies. read () is the interface for the entire kernel to read the precise monotonous time count. The kernel will use it to calculate other times, such as jiffies and xtime.
The introduction of clocksource solves the problem that each arch in the kernel has its own clock device management method, which is basically hidden in the MSL layer and is hard to access by the kernel core and driver. it exports the following interfaces:
1) clocksource_register () registers clocksource
2) clocksource_get_next () gets the current clocksource Device
3) clocksource_read () reads clock and actually runs to clocksource-> Read ()
When the time precision of the driver processing is relatively high, you can use the above interface to directly read the clock device.
Of course, the ticker clock interruption source also exists in clocksource format.

Clock event
Clock events are mainly used to distribute clock events and set the next trigger condition. Clock interruptions are generated cyclically before there is no clock event, that is, well-known jiffies and Hz.
The main structure of clock event device:


struct clock_event_device {
const char *name;
unsigned int features;
unsigned long max_delta_ns;
unsigned long min_delta_ns;
unsigned long mult;
int shift;
int rating;
int irq;
cpumask_t cpumask;
int (*set_next_event)(unsigned long evt,
struct clock_event_device *);
void (*set_mode)(enum clock_event_mode mode,
struct clock_event_device *);
void (*event_handler)(struct clock_event_device *);
void (*broadcast)(cpumask_t mask);
struct list_head list;
enum clock_event_mode mode;
ktime_t next_event;
};

The most important thing is set_next_event (), event_handler (). the former is used to set the trigger condition for the next clock event. Generally, it is used to reset the timer in the clock device. the latter is event handler, which is an event processing function. this processing function will be called in the ISR for clock interruption. if this clock is used as a ticker clock, the handler execution is basically the same as the ISR interrupt of the previous kernel clock, similar to timer_tick (). the event processing function can be dynamically replaced at runtime, which gives the kernel an opportunity to change the processing method of the entire clock interruption, and also gives the highres tick and dynamic tick a chance to dynamically mount. currently, there are three clock interrupt processing methods in the kernel: periodic, highres, and dynamic tick. it will be introduced later.

Hrtimer & timer Wheel
First, let's talk about timer wheel. It is the jiffies-based timer mechanism used by kernel. The interfaces include init_timer (), mod_timer (), del_timer (), etc.
The emergence of hrtimer does not discard the old timer wheel mechanism (or it is unlikely to discard it :)). hrtimer is used as the timer in the kernel, while timer wheel is mainly used for the timeout timer. the division of labor is clear. hrtimers uses the red and black trees to organize timers, while timer wheel uses linked lists and buckets.
The hrtimer precision is increased from the jiffies of the original timer wheel to the nanosecond. It is mainly used to provide the nanosleep, POSIX-timers and itimer interfaces to the application layer. Of course, the driver and other subsystems also need high resolution timer.
In the kernel, Hz ticker (Interrupt) is generated cyclically every second, and is replaced by the interruption at the next expired hrtimer time point. that is to say, the clock interruption is no longer cyclical, but driven by timer (the set_next_event interface of clockevent is used to set the next event interruption). As long as there is no hrtimer load, there will be no interruption. however, to ensure that the system time (process time statistics, jiffies maintenance) is updated, every tick_period (nsec_per_sec/Hz, once again emphasizing that the accuracy of hrtimer is nsec) will have a hrtimer load called tick_sched_timer.
Next we will compare the differences between the clock interruption processing in the kernel before and after hrtimer is introduced. (The analysis is based on the source of arm arch)
1) No hrtimer
In kernel mode, time_init () after setup_arch () Will initialize the timer under the corresponding machine structure. the timer initialization function is in the architecture code of each machine. After the hardware clock is initialized, the interrupt service function is registered to enable the clock interruption. the interrupted service program will clear the interruption and call timer_tick () to execute:
1. profile_tick ();/* kernel profile, not very familiar */
2. do_timer (1);/* update jiffies */
3. update_process_times ();/* indicates the time consumed by the computing process, which calls timer_softirq (timer wheel) and recalculates the scheduling time slice */
Finally, the interrupt service program sets a timer to interrupt the next tick.

Such a framework makes it difficult for the timer of high-res to join. all Interrupt Processing code is written to the system structure code, and the code reuse rate is very low. After all, most arch will write the same interrupt processing function.
2) hrtimer
With the introduction of clockevent/source in the kernel, The clocksource interrupt is abstracted as an event. handle the event to event handler. handler can be replaced in the kernel to change the clock interruption. the clock interruption ISR will look like this:


static irqreturn_t timer_interrupt(int irq, void *dev_id)
{
/* clear timer interrupt flag */
.....
/* call clock event handler */
arch_clockevent.event_handler(&arch_clockevent);
....
return IRQ_HANDLED;
}

When registering a clockevent device, event_handler is set to tick_handle_periodic () by default (). so when the kernel is up, the clock processing mechanism is still periodic, and ticker is interrupted periodically. tick_handle_periodic () will do something similar to timer_tick, and then call clockevents_program_event () => arch_clockevent.set_next_event () to set the timer of the next cycle. in the tick-common.c, the original kernel clock processing method is implemented under the clockevent framework, which is the clock mechanism of periodic tick.

The hres tick mechanism replaces periodic tick in the first timer softirq. Of course, certain conditions must be met. For example, hres (highres = OFF) is not disabled in command line, clocksource/event supports hres and oneshot. here, the switchover is compared with uugly. The author's comments also mentioned that every time timer softirq is scheduled, hrtimer_run_queues () is called to check whether hres is active. If it can be in timer_init () check the clocksource/event condition and switch to hres directly. I don't know if there are any restrictions. the Code of timer softirq is as follows:


Static void run_timer_softirq (struct softirq_action * H)
{
Tvec_base_t * base = _ get_cpu_var (tvec_bases );

Hrtimer_run_queues ();/* switch to hres or nohz */

If (time_after_eq (jiffies, base-> timer_jiffies ))
_ Run_timers (base);/* timer wheel */
}

The switching process is relatively simple. Use hrtimer_interrupt () to replace the current clockevent hander, load a hrtimer: tick_sched_timer which expires in the next tick_period, and retrigger the next event.
Hrtimer_interrupt () removes expired hrtimers from the red-black tree and places them in the corresponding clock_base-> cpu_base-> cb_pending list. These expired timers will be executed in hrtimer_softirq. retrigger the next event based on the earliest expired timer, and then schedule hrtimer_softirq. hrtimer softirq executes the expired timer functions on cb_pending. the hrtimer of tick_sched_timer will expire in each tick_period. the execution process is similar to that of timer_tick (). It is only after calling hrtimer_forward to load itself into the next cycle, ensure that each tick_period can correctly update the kernel internal time statistics.

Timekeeping

The timekeeping subsystem updates xtime, adjusts errors, and provides the get/settimeofday interface. To facilitate understanding, we first introduce some concepts:
Times in Kernel
The basic time type of the kernel:
1) system time
A monotonically increasing value that represents the amount of time the system has been running. The monotonically increasing system running time can be calculated by time source, xtime, and wall_to_monotonic.
2) wall time
A value representing the human time of day, as seen on a wrist-watch. realtime time: xtime.
3) time Source
A representation of a free running counter running at a known frequency, usually in hardware, e. g GPT. the counter value can be obtained through clocksource-> Read ()
4) tick
A periodic interrupt generated by a hardware-timer, typically with a fixed interval
Defined by Hz: jiffies

These times are correlated and mutually convertible.
System_time = xtime + cyc2ns (clock-> Read ()-clock-> cycle_last) + wall_to_monotonic;
Real_time = xtime + cyc2ns (clock-> Read ()-clock-> cycle_last)
That is to say, real time is the nanosecond from January 1, 1970 to the present, and system time is the nanosecond from the system startup to the present.
These two are the most important time. Therefore, hrtimer can set the expiration time based on these two times. Therefore, two clock bases are introduced.

Clock Base
Clock_realtime: Base in the actual wall time
Clock_monotonic: base runs the system time
Hrtimer can select one of them to set the expire time, which can be the actual time or relative system time.
They provide the get_time () interface:
Clock_realtime calls ktime_get_real () to obtain the real time. This function uses the equation mentioned above to calculate the realtime.
Clock_monotonic calls ktime_get () and obtains monotonic time using the system_time equation.

Timekeeping provides two interfaces: do_gettimeofday ()/do_settimeofday (), all of which are for realtime operations. The user space's syscall for gettimeofday will eventually come here.
Do_gettimeofday () will call _ get_realtime_clock_ts () to obtain the time and convert it to timeval.
Do_settimeofday (): updates the time set by the user to xtime, recalculates the conversion value from xtime to monotonic, and finally notifies the hrtimers subsystem of time change.


Int do_settimeofday (struct timespec * TV)
{
Unsigned long flags;
Time_t wtm_sec, SEC = TV-> TV _sec;
Long wtm_nsec, nsec = TV-> TV _nsec;

If (unsigned long) TV-> TV _nsec> = nsec_per_sec)
Return-einval;

Write_seqlock_irqsave (& xtime_lock, flags );

Nsec-= _ get_nsec_offset ();

Wtm_sec = wall_to_monotonic. TV _sec + (xtime. TV _sec-Sec );
Wtm_nsec = wall_to_monotonic. TV _nsec + (xtime. TV _nsec-nsec );

Set_normalized_timespec (& xtime, SEC, nsec);/* re-calculate xtime: the time set by the user minus the nsec from the previous cycle to the present */
Set_normalized_timespec (& wall_to_monotonic, wtm_sec, wtm_nsec);/* reset wall_to_monotonic */

Clock-> error = 0;
Ntp_clear ();

Update_vsyscall (& xtime, clock );
Write_sequnlock_irqrestore (& xtime_lock, flags );
/* Signal hrtimers about time change */
Clock_was_set ();

Return 0;
}

Userspace Application
The introduction of hrtimer provides the following interfaces:

Clock API
Clock_gettime (clockid_t, struct timespec *)
Obtain the corresponding clock time
Clock_settime (clockid_t, const struct timespec *)
Set the corresponding clock time
Clock_nanosleep (clockid_t, Int, const struct timespec *, struct timespec *)
Process nano sleep
Clock_getres (clockid_t, struct timespec *)
Obtain the time precision, generally nanosec

Clockid_t defines four types of Clock:

Clock_realtime
System-wide realtime clock. setting this clock requires appropriate privileges.
Clock_monotonic
Clock that cannot be set and represents monotonic time since some unspecified starting point.
Clock_process_cputime_id
High-resolution per-process timer from the CPU.
Clock_thread_cputime_id
Thread-Specific CPU-time clock.

The first two mentioned above, the last two are related to the process/thread statistical time, and have not been carefully studied, such as utime/stime. the application layer can use these four clock types to improve flexibility and accuracy.

Timer API

Timer can be used to create a process timer.

Int timer_create (clockid_t clockid, struct sigevent * restrict EVP, timer_t * restrict timerid );
Create a timer.
Clockid specifies the clock base under which the timer is created.
EVP (sigevent) can specify the signal sent by the kernel to the process after the timer expires, and the parameter included in the signal. The default value is sigalrm.
Timerid returns the ID of the created timer.
In the signal processing function, you can use siginfo_t.si_timerid to obtain the timer that triggers the current signal. After the experiment, the maximum number of timers that can be created is related to the pending signals in ulimit, and cannot exceed the number of pending signals.

Int timer_gettime (timer_t timerid, struct itimerspec * value );
Obtain the next expiration time of the timer.

Int timer_settime (timer_t timerid, int flags, const struct itimerspec * restrict value, struct itimerspec * restrict ovalue );
Set the expiration time and interval of the timer.

Int timer_delete (timer_t timerid );
Delete the timer.

All these system calls will create a posix_timer hrtimer and send signals to the process upon expiration.

Summary
The introduction of hrtimer and clockevent/source greatly contributes to the improvement of the real-time performance of the kernel. It also abstracts the clock processing from the system structure code, enhancing the reusability of the Code. It also provides strong support for POSIX time/Timer standards, improving the time processing precision and flexibility of user space applications. If the application layer has any questions when using these syscall, you can directly look at the hrtimer code, which is of great help to solve the problem and understand the OS behavior.

References:
[1] http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf
[2] http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf
[3] documentation/hrtimers/highres.txt
[4] documentation/hrtimers/hrtimers.txt
[5] http://sourceforge.net/projects/high-res-timers/

Http://muddogxp.cublog.cn/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.