The high-resolution timer API

[Posted January 16, 2006 by corbet]

Last September, this page featured an article on the ktimers patch by Thomas Gleixner. The new timer abstraction was designed to enable the provision of high-resolution timers in the kernel and to address some of the inefficiencies encountered when the current timer code is used in this mode. Since then, there has been a large amount of discussion, and the code has seen significant work. The end product of that work, now called "hrtimers," was merged for the 2.6.16 release.

At its core, the hrtimer mechanism remains the same. Rather than using the "timer wheel" data structure, hrtimers live on a time-sorted linked list, with the next timer to expire being at the head of the list. A separate red/black tree is also used to enable the insertion and removal of timer events without scanning through the list. But while the core remains the same, just about everything else has changed, at least superficially.

There is a new type, ktime_t, which is used to store a time value in nanoseconds. This type, found in <linux/ktime.h>, is meant to be used as an opaque structure. And, interestingly, its definition changes depending on the underlying architecture. On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds. On 32-bit machines, however, it is a two-field structure: one 32-bit value holds the number of seconds, and the other holds nanoseconds. The order of the two fields depends on whether the host architecture is big-endian or not; they are always arranged so that the two values can, when needed, be treated as a single, 64-bit value. Doing things this way complicates the header files, but it provides for efficient time value manipulation on all architectures.

A whole set of functions and macros has been provided for working with ktime_t values, starting with the traditional two ways to declare and initialize them:

    DEFINE_KTIME(name);   /* Initialize to zero */

    ktime_t kt;
    kt = ktime_set(long secs, long nanosecs);

Various other functions exist for changing ktime_t values; all of these treat their arguments as read-only and return a ktime_t value as their result:

    ktime_t ktime_add(ktime_t kt1, ktime_t kt2);
    ktime_t ktime_sub(ktime_t kt1, ktime_t kt2);  /* kt1 - kt2 */
    ktime_t ktime_add_ns(ktime_t kt, u64 nanoseconds);

Finally, there are some type conversion functions:

    ktime_t timespec_to_ktime(struct timespec tspec);
    ktime_t timeval_to_ktime(struct timeval tval);
    struct timespec ktime_to_timespec(ktime_t kt);
    struct timeval ktime_to_timeval(ktime_t kt);
    clock_t ktime_to_clock_t(ktime_t kt);
    u64 ktime_to_ns(ktime_t kt);

The interface for hrtimers can be found in <linux/hrtimer.h>. A timer is represented by struct hrtimer, which must be initialized with:

    void hrtimer_init(struct hrtimer *timer, clockid_t which_clock);

Every hrtimer is bound to a specific clock. The system currently supports two clocks, being:

CLOCK_MONOTONIC: a clock which is guaranteed always to move forward in time, but which does not reflect "wall clock time" in any specific way. In the current implementation, CLOCK_MONOTONIC resembles the jiffies tick count in that it starts at zero when the system boots and increases monotonically from there.
CLOCK_REALTIME which matches the current real-world time.

The difference between the two clocks can be seen when the system time is adjusted, perhaps as a result of administrator action, tweaking by the network time protocol code, or suspending and resuming the system. In any of these situations, CLOCK_MONOTONIC will tick forward as if nothing had happened, while CLOCK_REALTIME may see discontinuous changes. Which clock should be used will depend mainly on whether the timer needs to be tied to time as the rest of the world sees it or not. The call to hrtimer_init() will tie an hrtimer to a specific clock, but that clock can be changed with:

    void hrtimer_rebase(struct hrtimer *timer, clockid_t new_clock);

Most of the hrtimer fields should not be touched. Two of them, however, must be set by the user:

    int  (*function)(void *);
    void *data;

As one might expect, function() will be called when the timer expires, with data as its parameter.

Actually setting a timer is accomplished with:

    int hrtimer_start(struct hrtimer *timer, ktime_t time,
                      enum hrtimer_mode mode);

The mode parameter describes how the time parameter should be interpreted. A mode of HRTIMER_ABS indicates that time is an absolute value, while HRTIMER_REL indicates that time should be interpreted relative to the current time.

Under normal operation, function() will be called after (at least) the requested expiration time. The hrtimer code implements a shortcut for situations where the sole purpose of a timer is to wake up a process on expiration: if function() is NULL, the process whose task structure is pointed to by data will be awakened. In most cases, however, code which uses hrtimers will provide a callback function(). That function has an integer return value, which should be either HRTIMER_NORESTART (for a one-shot timer which should not be started again) or HRTIMER_RESTART for a recurring timer.

In the restart case, the callback must set a new expiration time before returning. Usually, restarting timers are used by kernel subsystems which need a callback at a regular interval. The hrtimer code provides a function for advancing the expiration time to the next such interval:

    unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t interval);

This function will advance the timer's expiration time by the given interval. If necessary, the interval will be added more than once to yield an expiration time in the future. Generally, the need to add the interval more than once means that the system has overrun its timer period, perhaps as a result of high system load. The return value from hrtimer_forward() is the number of missed intervals, allowing code which cares to detect and respond to the situation.

Outstanding timers can be canceled with either of:

    int hrtimer_cancel(struct hrtimer *timer);
    int hrtimer_try_to_cancel(struct hrtimer *timer);

When hrtimer_cancel() returns, the caller can be sure that the timer is no longer active, and that its expiration function is not running anywhere in the system. The return value will be zero if the timer was not active (meaning it had already expired, normally), or one if the timer was successfully canceled. hrtimer_try_to_cancel() does the same, but will not wait if the timer function is running; it will, instead, return -1 in that situation.

A canceled timer can be restarted by passing it to hrtimer_restart().

Finally, there is a small set of query functions. hrtimer_get_remaining() returns the amount of time left before a timer expires. A call to hrtimer_active() returns nonzero if the timer is currently on the queue. And a call to:

    int hrtimer_get_res(clockid_t which_clock, struct timespec *tp);

will return the true resolution of the given clock, in nanoseconds.

Index entries for this article
Kernel	hrtimer
Kernel	Timers

The high-resolution timer API

Posted Jan 19, 2006 11:26 UTC (Thu) by NAR (subscriber, #1313) [Link] (1 responses)

I might misunderstand something, but will this code work on 32 bit architectures after 2038? In this case the seconds value of ktime_t is only 32 bit long and if the timer is bound to CLOCK_REALTIME, is it possible to install a timer that will expire sometime in 2038 with the HRTIMER_ABS mode?

Bye,NAR

The high-resolution timer API

Posted Jan 19, 2006 16:49 UTC (Thu) by vmole (guest, #111) [Link]

There's no requirement that the epoch for ktime_t be the same as for time_t. (Or, if there is, I missed it...)

The high-resolution timer API

Posted Jan 19, 2006 13:14 UTC (Thu) by hein.zelle (guest, #33324) [Link] (1 responses)

A question that may not be sensible at all, as this concerns kernel functions: will these functions be usable from userspace somehow, to achieve better precision than e.g. 0.01 second when using functions like usleep() or gettimeofday() ? I am not familiar with the technical details, but my earlier attempts at achieving greater time resolution all failed.

Is there already a common way to achieve higher resolution from a C program?

High-resolution timers in user space

Posted Jan 19, 2006 14:51 UTC (Thu) by corbet (editor, #1) [Link]

I suppose I could have said something about that... hrtimers are used now for the implementation of POSIX timers and for the nanosleep() call, so, in that sense, yes they are available to user space.

The other thing which I really should have mentioned (I did in an earlier article) was that, in order to provide truly high resolution, you also need a high-resolution clock within the kernel. Current kernels still do not have that, so the hrtimer interface still works with HZ resolution - 4ms on i386 with the default configuration. There are a few high-resolution clock patches around, mainly tied to John Stultz's low-level clock rework; something should get merged before too long, I would think (but not for 2.6.16).

64-bit/32-bit compatibility

Posted Jan 21, 2006 1:00 UTC (Sat) by roelofs (guest, #2599) [Link] (4 responses)

On 64-bit systems, a ktime_t is really just a 64-bit integer value in nanoseconds. On 32-bit machines, however, it is a two-field structure: one 32-bit value holds the number of seconds, and the other holds nanoseconds. The order of the two fields depends on whether the host architecture is big-endian or not; they are always arranged so that the two values can, when needed, be treated as a single, 64-bit value.

Perhaps I'm missing something, but I was always under the impression that the number of nanoseconds in a second (10⁹) was not actually a power of two. Even if it were, it's not 2³²...

So how do we reconcile the two-32-bit-values-as-one-64-bit-value problem? Does the nanoseconds half count up monotonically for 4.3 seconds, at which point some weird adjustment is made? (No, that would seem to be incompatible with the 64-bit view of things.) Is the "seconds" half not really seconds but actually a count of 4.3[etc.]-second intervals? Or is the "nanoseconds" half not really nanoseconds but actually a count of 1/4.3[etc.]-nanosecond super-micro-jiffies?

Am I missing something obvious? (I didn't get much sleep last night, so I might be...)

Greg

64-bit/32-bit compatibility

Posted Jan 21, 2006 10:02 UTC (Sat) by mingo (guest, #31122) [Link] (3 responses)

There are two "models". Every architecture picks a model at build-time, and sticks to it - the model's format is compiled into the kernel, and ktime inline functions behave differently in both models.

the first one is the 'scalar' model, which is used on 64-bit platforms, where the 64-bit value contains nanosec values, and if we ever get seconds values from userspace, it's converted into nanosecs.

the second is the 'union' model, where the 64-bit word is split into two 32-bit fields, the upper one holds seconds, the lower one nanoseconds. ktime_t values are always in 'normalized' form: the lower 32-bit must only contain values up to 10^9. (I.e. the range 0x3b9aca00-0xffffffff is excluded from the lower 32-bits, only range 0x00000000-0x3b9ac9ff is allowed.) The 64-bit word can also be accessed as a whole, via a union field in the ktime_t structure.

you might ask: what is the win opposed to having two separate 32-bit fields? Even though values always have to be normalized after operations on them (addition, subtraction), it's still beneficial to do arithmetics on the 64-bit field:

/* add 'delta' to 'ktime' */

ktime.tv64 += delta.tv64;
if (ktime.nsec >= 1000000000) {
ktime.nsec -= 1000000000;
ktime.sec++;
}

note that in the above op there is no division nor multiplication, and since both ktime values were normalized, only a single-step normalization is needed afterwards.

64-bit/32-bit compatibility

Posted Jan 21, 2006 18:43 UTC (Sat) by roelofs (guest, #2599) [Link]

Ah, I see now--thank you. I almost had the right idea, but I read too much into Jon's "single 64-bit value" comment.

Greg

64-bit/32-bit compatibility

Posted Jan 31, 2006 21:37 UTC (Tue) by efexis (guest, #26355) [Link] (1 responses)

heh, you could even use MMX or other SIMD to work on both halfs with one instruction ;-)

64-bit/32-bit compatibility

Posted Mar 20, 2006 16:01 UTC (Mon) by jengelh (subscriber, #33263) [Link]

And floating point is unfortunately a no-go in kernelspace.