1000 single bit operations per FLOP

Posted Jun 23, 2008 6:56 UTC (Mon) by jzbiciak (guest, #5246)
In reply to: 1000 single bit operations per FLOP by mikov
Parent article: The Kernel Hacker's Bookshelf: Ultimate Physical Limits of Computation

First, let's assume that each computation is done in the minimum number of steps possible. Many of the transistors spent in functional units today are spent on making a single copy of the operation go faster. When computing at an atomic scale, this probably no longer makes sense. It probably makes more sense to make each operation as simple and economical as possible so you can put down as many copies as you like precisely where they're needed.

With volume computing you'll have so much more connectivity to neighbors that economical implementations would likely win out. For instance, a carry-select adder, which computes two versions of the result speculatively and throws away one, wouldn't make sense since it's much, much larger than a ripple-carry adder. (IIRC, a state of the art 32-bit adder is around 3000 transistors, whereas a 32-bit ripple carry adder should be around 700 or so.)

A 1 bit full-adder—3 inputs, 2 outputs—consists of two pieces: The sum computation, which is two two-input XOR gates, and the carry computation, which is 3 two-input AND gates and 2 two-input OR gates. The logic equations are:

sum_out = a XOR b XOR carry_in
carry_out = (a AND b) OR (a AND c) OR (b AND c)

If you assume a ripple carry implementation, then that's it. 7 operations per bit if you allow the full complement of boolean operations. So, a 32-bit add will be a mere 224 boolean operations. Even if you penalize XOR and count them as 3 bit operations each, that's still only 352 boolean operations. Of course, IEEE-754 floating point isn't built around 32-bit adds. You go through the following major steps for a single-precision floating point addition:

(For now I assume a parallel implementation since it doesn't really affect the number of SBOPs. Also, I assume the more conservative 11 SBOP number for a single bit full adder.)

Subtract the mantissas to get the "alignment difference" between the two numbers. This is an 8 bit subtract, IIRC, so would cost 88 SBOPs.
Shift the numbers to align them. To be conservative, I'll say this comprises the following steps in a simplistic implementation:

Swap the two numbers based on the sign of the difference of the magnitudes. This requires 24 2:1 muxes, and each 2:1 mux costs 3 SBOPs, for a cost of 2 × 24 × 3 = 144 SBOPs.
Shifting the smaller magnitude number to the right by up to 24 positions. If you structure this as 5 layers of 2:1 muxes (which is overkill, since some of the inputs will be zero and so this could be optimized), you get 5 × 24 × 3 = 360 SBOPs.

Apply the sign to both numbers. IEEE-754 is stored in sign-magnitude form. To apply the sign, you merely need to XOR the number with the sign and inject an appropriate carry when you do the addition. Thus, we can estimate this as 48 XORs, which, if you count XOR as 3 SBOPs, is 144 SBOPs.
Actually add the two 24 bit mantissas. If we go with 11 SBOPs per bit, this is 264 SBOPs.
Strip the sign from the result. Again, you could simply XOR with the result's sign bit (which should be the carry out of the last adder). 3 × 24 = 72 SBOPs.
Count the number of leading zeros so that we can renormalize the result. I don't know off the top of my head how many SBOPs this ought to take.
Shift the final result to normalize it. Conservatively, let's use the crummy shifter above which is 5 × 24 × 3 = 360 SBOPs
Update the mantissa. This is another 8-bit add. 8 × 11 = 88 SBOPs.

So where does that put us in this really basic, if perhaps naive implementation? Around 1520 SBOPs plus the cost of the priority encoder used for normalization.

Sure, we haven't handled overflow, underflow, denormals, NaNs and so on. But, I also have assumed clumsy implementations for some of the more expensive bits, such as the shifters. If the shifters were half the cost, for example, the total drops to 1160. Overall, I'd say this quick and dirty analysis validates that Val's SWAG (Scientific Wild Ass Guess) is in the right ball park, and certainly well within an order of magnitude. Since nearly all the rest of the numbers are expressed in orders of magnitude, being off by 50% ain't so bad. The log10 of 1.5 is pretty small.