Kernel Summit 2005: RAS tools

[Posted July 20, 2005 by corbet]

From LWN's 2005 Kernel Summit coverage.

Suparna Bhattacharya led a session on RAS (reliability, serviceability, and availability) tools. The state of the art has advanced somewhat in the last year; this session was thus mostly a status report, rather than a place where future work was to be discussed.

The kexec and kdump patches (last covered here in June) have been merged into the mainline. Together these patches enable the creation of a far more reliable crash dump capability than Linux has had in the past. There is still work to be done, however, much of it in user space, to get crash dumps to a point where they can be deployed by vendors.

There's also a few remaining issues. Driver initialization is one of them; after a kernel crash (or any other invocation of a new kernel with kexec) the BIOS initialization will not have been performed. So drivers will have to reset their hardware from an unknown initial state. Getting the frame buffer back into working condition is a challenge in the best of times, and will be made more difficult in a panic situation. It is also important to put an end to any DMA operations which may have been happening when the kernel crash took place. That, in turn, may require a big bus reset, something the kernel normally tries to avoid doing. All of this implies that the kernel needs a flag saying "this is a crash dump kernel" so that it can take the appropriate steps.

Keeping the analysis tools in sync with the kernel will also be a challenge; the high rate of change affects these tools just as much as it affects drivers. Crash dumps are most likely to be used with "enterprise" distribution kernels, however, which do not change very often.

There was some talk of relayfs, a tool for getting large amounts of trace data out of the kernel in a hurry. It turns out that the current relayfs implementation allows the trace data to be read via mmap(), but not via a normal read() call. There was a strange discussion on whether it was appropriate to implement read() until Linus decreed that it was silly not to.

Index entries for this article
Kernel	Kexec
Kernel	RAS tools

RAS tools - not meant for just "enterprise" distributions

Posted Jul 21, 2005 12:55 UTC (Thu) by suparna (guest, #7766) [Link] (1 responses)

Just a quick observation - first failure data capture tools are pretty useful in general and not likely to be limited to "enterprise" distributions (even though that is where the tools have been most desperately needed). For example kexec based crash dumps have already been used for resolving problems reported by testers during regular kernel development, and now that it is in mainline, wider use for better bug reporting will hopefully help improve the rate of resolution of bugs and hence quality of kernel development. Likewise with probe handler utilities. This is why keeping sync is an especially important design issue to be tackled, so that the tools are useful all through.

RAS tools - not meant for just "enterprise" distributions

Posted Sep 17, 2005 11:54 UTC (Sat) by dkumargupta (guest, #25680) [Link]

Completely agree with Suparna, the sync of kernel with tools is must. Many of the enterprise class kernels (aix, solaris, irix) kept the provision for tools by design and not as addon, thats why maintainability is so easy.

In case of linux "Reliable" first crash analysis support would definately improve the quality the kernel.

One of my expectations of summit was a road map for future apart from status. Any idea from from anybody would be useful. I still believe that "kernel stress testing" is big challenge. Creation of uncertain scenario and creating extreme load conditions are far away from LTP or any similar tools which i have seen yet.

Comments are welcome..

Best Regards
Deepak Kumar Gupta
Project Leader
System Software Group (OS Domain)
HCL Technlogies Ltd
Noida- UP