The 2012 Kernel Summit

Posted Sep 10, 2012 19:48 UTC (Mon) by Jonno (subscriber, #49613)
In reply to: The 2012 Kernel Summit by BenHutchings
Parent article: The 2012 Kernel Summit

> [citation needed] What I hear is that is that the performance hit is really marginal for almost all applications.

http://www.memetic.org/raspbian-benchmarking-armel-vs-armhf
Shows a 5-40% improvement or most applications. Not all the difference is due to hardfloat, some is due to using an armv6, rather than armv4, instruction set, but still very relevant to the choice between Debian armel and Raspbian armhf for the Raspberry Pi.

https://wiki.linaro.org/OfficeofCTO/HardFloat/Benchmarks2...
A more "fair" set of armv7 vs armv7 benchmarks showing up to 1400% improvement, though indeed most applications are within the margin of error. The most interesting gains are IMHO in the ffmpeg and gtk benchmarks.

The 2012 Kernel Summit

Posted Sep 11, 2012 3:20 UTC (Tue) by BenHutchings (subscriber, #37955) [Link] (3 responses)

http://www.memetic.org/raspbian-benchmarking-armel-vs-armhf
Shows a 5-40% improvement or most applications.

I think I wasn't very clear. It is certainly possible to build significantly faster binaries for the RPi processor than are currently available in the armel architecture. But how much of the performance improvement is due to using the hard-float vs soft-float ABI and how much is due to optimising for v6 vs v4? You see, it should be possible to build optimised libraries for v6 processors and then use the dynamic linker's CPU feature checks to select between the v4-compatible and optimised v6 builds (like we do on i386 for libraries that benefit from use of CMOV and SSE2). And those could be added to Debian without any need to fork or rebuild. Has that been tried and compared?

https://wiki.linaro.org/OfficeofCTO/HardFloat/Benchmarks2...
A more "fair" set of armv7 vs armv7 benchmarks showing up to 1400% improvement, though indeed most applications are within the margin of error. The most interesting gains are IMHO in the ffmpeg and gtk benchmarks.

I can't help wondering whether the huge gains for two POVray tests (not seen in the Phoronix benchmarks that use POVray) are due to some mistake in building them (e.g. building without using the FPU at all). Also tests on a v7 core aren't necessarily representative of the RPi's v6 core. But this is hopefully somewhat indicative of the difference between hard-float and soft-float, and the results really are quite mixed!

The 2012 Kernel Summit

Posted Sep 11, 2012 3:34 UTC (Tue) by dlang (guest, #313) [Link] (2 responses)

> e.g. building without using the FPU at all)

Isn't that what softfloat vs hardflat is? softfloat is software floating point (i.e. no FPU) while hardfloat is using the hardware floating point engine (i.e., use the FPU)

The 2012 Kernel Summit

Posted Sep 11, 2012 5:23 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]

Isn't that what softfloat vs hardflat is? softfloat is software floating point (i.e. no FPU) while hardfloat is using the hardware floating point engine (i.e., use the FPU)

Not if you're referring to the ABIs, which are the fundamental difference between what dpkg calls armel and armhf. The soft-float variant of ARM EABI (armel) requires that FP parameters and return values are passed in integer registers, for compatibility with FPU-less processors. Functions built for the soft-float ABI may still use an FPU if present, but they have to move values between integer and FP registers at call boundaries. The benefit of the hard-float ABI (armhf) in terms of code generation is that it uses the FP registers for parameters and return values, reducing the need for register moves.

Of course, targetting armhf does mean all code can be built to assume an FPU is present, without the need to provide a fallback for less capable processors. This is simpler in some ways as there is no need to do run-time CPU checks in executables, and no need to build multiple variants of libraries. But looking at this from a higher level again, adding optimised libaries to Debian's armel should mean less work in the long term for Raspbian...

The 2012 Kernel Summit

Posted Sep 11, 2012 8:21 UTC (Tue) by Jonno (subscriber, #49613) [Link]

The biggest difference between armhf and armel is that what an armhf compiler would have compiled into a single floating point instruction, an armel compiler will instead make into a function call.

On a floating point capable system that function will only be three instructions long (copy the arguments from integer registers to floating point registers, the actual floating point instruction, copy the result from a floating point register to an integer register), while on other systems the function will contain a complex emulation instead.

Thus, even armel will make use of the floating point capabilities of a processor that has one, but it will do so in an inefficient manner (for every instruction, add a function call and two register copies).

If a single user function contains many floating point operations, it is possible to minimize this cost by compiling two versions of it for armel, one that internally uses floating point instructions directly, and one that internally uses the standard emulation functions. That way you only have to pay the extra cost when crossing user-function-borders, rather than for every instruction. This can either be done on a function-by-function basis by the application developer, or on a library-by-library basis by the distributor (which is what Ben refers to). Adding such an alternative library for every fp-using library in Debian would have netted almost as large a benefit as that of Raspbian, without the need of forking the entire distribution, but I wouldn't want the job of trying to convince ~100 package maintainers and the release team that it was a worthwhile feature to add to wheezy in the middle of the pre-release freeze...