Topics from the LLVM microconference

Posted Aug 31, 2015 10:44 UTC (Mon) by nix (subscriber, #2304)
Parent article: Topics from the LLVM microconference

BPF for tracing is currently a hot area, Starovoitov said. It is a better alternative to SystemTap and runs two to three times faster than Oracle's DTrace. Part of that speed comes from LLVM's optimizations plus the kernel's internal just-in-time compiler for BPF bytecode.

This claim seems exceptionally unlikely to me. Interpreting DOF is really not an expensive operation: it's just a switch plus some very simple prologue/epilogue code for shuffling the arguments and return value into place plus the code needed to actually do what the DOF has asked, and most DTrace uses I've seen (even Brendan's! :) ) have no probes with anything longer than a few hundred opcodes attached to them: lacking loops and with only non-nested analogues of conditionals, D is not a language in which one would write something long or complicated enough to need optimization. All of DOF interpretation plus all the buffer management is going to be hugely dominated by the cost of taking a trap (for sdt/usdt) or a ring transition into kernel space (for systrace), so this only really applies to fbt, and if he's tested fbt on Linux I'd be quite astonished since it only exists on one person's computer so far.

But it may be true! It's possible that LLVM's native code for argument marshalling is better than the handwritten stuff DTrace uses, and it's just barely possible that in some synthetic workloads this dominates. If there's some actual data showing it, particularly if it's relevant outside pure benchmarks, I'd be fascinated to see it.

Topics from the LLVM microconference

Posted Aug 31, 2015 16:46 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Also consider locality and branch prediction - compiled eBPF code has no need to access other data or to branch (in most cases).

Topics from the LLVM microconference

Posted Sep 1, 2015 8:00 UTC (Tue) by nix (subscriber, #2304) [Link]

Hm. True enough. Accessing other data is unlikely to be relevant -- it's likely to be in L1 cache (it's almost always parameters of the containing function or other data that the kernel has just touched). Branches, though... the question is, why do branches dominate? The huge number of branches the kernel does anyway is *still* likely to dwarf them in anything but, say, synthetic benchmarks of tracing getpid() or something near-empty like that.

Really, without knowing the benchmark I'm left grasping in the dark.

(As for fixing it... branches could definitely be reduced, or predicted, I suppose, at least in the hot spots. We haven't really done much performance optimization of this bit of the system -- the assumption has been that getting into dtrace_probe() would almost always be the expensive part. So there is surely room for improvement here.)