[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Hacker News new | past | comments | ask | show | jobs | submit login
The Pentium contains a complicated circuit to multiply by three (righto.com)
310 points by Tomte 19 hours ago | hide | past | favorite | 117 comments





Multiplication by 3 is actually a common operation, particularly in address calculations, where a shift and an add means multiplying the index by 3. Implementing this naively would add significant latency. But using this circuitry the LEA (Load Effective Address) instruction can do it in a single cycle, so spending this much transistor budget on it was totally a good call.

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

or

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.


The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.


Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.

>> Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly.

Sure taking 8 instructions for a multiply is a lot, but microprocessors at that time didn't even have multiply and ran below 1MIPS. I just wanted to bring it up as yet another way someone implemented multiply, and it was pretty effective for its time.

EDIT: too late to edit my original comment, but it added the memory value to the upper half of the 24-bit result.


> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.


> invented the first multi-core cpu

The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.

"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.


Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.

I thought Kunle Olukotun led the team for the first multi-core CPU.

You may absolutely be right, I don’t know who did it “first”.

I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...

(In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)


Author here if anyone has questions...

What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?

Obsolete.

You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.

The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.

The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.

The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).

In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.

In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.

As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.

In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.

Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

[1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/


> Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 64 adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?


It seems you misunderstood the design of the multiplier described in the article.

The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).

The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.

As such, I suspect (but don't know) that this trick or some variant of it is used even today.


Yeah, you are right... I've misunderstood radix-8 multiplication, missed that this post was only talking about a small part of the Pentium's multiplier. and jumped to conclusions... And annoyingly, hackernews doesn't allow you to edit comments after a few hours

On the R3000/R4000/R4200, the 3-bits-per-cycle multipliers do use radix-8 multiplication, but they don't have a dedicated 3x multiplier. Instead the the 3x result is latched during the first cycle (by adding (x << 1) + x). For the remaining cycles it can do a 3-bit multiplication with nothing more than a bit of booth recoding logic, and a single 64-bit wide adder.

Then MIPS entirely abandoned this radix-8 encoding for the 8-bit-per-cycle multiplier in the R4400 and R4300, replacing it with a simple array of binary CSA adders. Probably because an array of base-2 adders is just much simpler. (Or at least that's what I think I can see on the R4300's die shot, I'm going to need to go back and take a much closer look at the multiplier)

Anything I say about radix-256 in my first comment is probably nonsense, it's not radix-256 simply because it can do 8-bits in one cycle.

What I missed is there is nothing limiting you to one radix-8 addition per cycle (like the early MIPS designs), you can combine the radix-8 encoding with an array of adders. And you only need 1/3rd the adders that a base-2 multiplier would need. The entire point of using the radix-8 encoding is that there is only one carry every 3 bits.

You are probably right. This trick with the dedicated 3x multiplier is probably still used today.


There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.

No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!

This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?

Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.

https://en.wikipedia.org/wiki/Extended_precision#x86_extende...


Ken, maybe it's time to publish a book?

That would require me to focus on one subject :-)

Then publish many books?

I ancient times, it would be nice to curl up to a nice scroll ,later on a nice bound book, now we are selecting the background colors on our e-readers. These articles are hard-core good.

I would like to hear much more about the R4200i than the 286.


I have to respect his choice to focus on the 8086 and the pentium, I would think he considers the 80286 a brain dead footnote, interesting as to what when wrong and what went terribly wrong.

I used to teach binary math to kids, and at a county fair I was upstaged by an 11 year old girl who demonstrated both multiplication by powers of two and division of powers of two.

"What does your father do for a living?'

"He is the director of the laser fusion project!"

"Oh."


The 286 could process twice as many instructions per clock as an 8086. According to Wikipedia, an uplift similar to the 80486 and the Pentium over their predecessors.

I only tenuously understand this so feel free to ignore the question if it's too asinine but:

> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

Why can't you do the same to subtract x4 from x7 to get x3?


Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)

x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.

The question would be why isn't it 4x - 1x?

The trick is that Booth's algorithm gives you the ×8 term for free; you multiply by one more in the next base-8 digit. You can put either ×4 or -×1 into a term, but you can't put both of them into one term. So you can do ×8 - ×1 but you can't do ×4 - ×1.

I guess the decimal equivalent would be multiplying by 10 vs multiplying by 5.

Or for hexadecimal: multiplying by 0x10 (16) or by 2, 4 or 8.


I suspect it's logically equivalent, but one bit wider in hardware.

Binary 4x is left shift by 2, rather than 2x which is by one.

Adding or subtracting in 2's compliment space is the same operation for the bits of the word.


Not really a question, just one phrase that left me scratching my head a bit:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial.

I think you can leave out the "almost" there - multiplying by 0 is 0 and multiplying by 1 is the number itself, as shown in the example, so I would definitely call it trivial.


"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.


> Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics

We’ve been at that limit for decades.


The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.

…but that’s how it’s always worked. Moore’a law is dead, we’re at the limit of everything, oh hey, Moore’s lawn limps by again because someone did something clever.

This is the kind of comment that will keep me laughing for weeks. Moore's lawn is in fact dead. We need to go back to calling what Moore said as his observational insight.

That's never how it worked. Moore's law was never dead. People are just endlessly confused about what Moore's law is.

What ended was Dennard scaling around 2006. Roughly that frequency would keep going up as feature size went down. But because so many people are confused about what is what, you see a crappy muddled message.

Moore's law has been going strong. It must end eventually, current predictions are that it will be in a decade or two.


It's starting to get a bit old that whenever I see Moore's law mentioned, I'll usually also run into a spiel about how people have the wrong idea about what it actually refers to, and that it's holding up just fine. This is despite the gen-on-gen and year-on-year performance improvements of computer hardware very clearly tapering off in recent memory.

Maybe debating what always-somehow-wrong law to cite should not be the focus? Like it's very clear to me that being technically correct about what Moore's law or the Dennard scaling refers to is leaps and bounds less important than the actual, practical computing performance trends that have been observable in the market.


What we see in the market is caused by software bloat. Chips are gaining performance faster than ever in absolute terms.

I think Moore’s law should be avoided altogether when discussing progress in this area, because it’s hard to understand the effects of doubling intuitively. Rice grains on chessboards and all that.

One might think ”Moore’s law is slowing down” means progress was faster before and slower now, when it is in fact completely opposite.

If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million.

Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.

So in 2 years we added 500 times more transistors to our cpus than the first 20 years of “prime Moore’s law” did.

This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat, and the fact that most users aren’t doing things with their computers where they can notice any improvements in performance.


> Chips are gaining performance faster than ever in absolute terms.

But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.

> If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million. Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.

Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%! Those Ryzens went from 8.3B to 13.2B, a +59% difference. But even this is misleading, because we're not considering die area or any other parameter.


>But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.

The 4090 and 5090 are the same generation in reality, using the same process node. The 5090 is a bit larger but only has about 20% more transistors of the same type compared to the 4090. Which of course explains the modest performance boosts.

Nvidia could have made the 5090 on a more advanced node but they are in a market position where they can keep making the best products on an older (cheaper) node this time.

>Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%!

That was my point though, to highlight how relative differences in percentages represent vastly different actual performance jumps. It quickly becomes meaningless since it's not the percentages that matter, it is the actual number of transistors.

To put it another way - If you take the first pentium with about 3 million transistors as a baseline, you can express performance increases in "how many pentiums are we adding" instead of using percentages, and note that we are adding orders of magnitude more "pentiums of performance" per generation now than we did 10 years ago.


> That was my point though, to highlight how relative differences in percentages represent vastly different actual performance jumps. It quickly becomes meaningless since it's not the percentages that matter, it is the actual number of transistors.

But it is the percentage that matters? If I have 10B transistors and I add 1B to it, the speedup I can expect is 10%. If I have 1B transistors and add 1B to it, the speedup I can expect is 100%. Tremendous difference. Why would I ever care about the absolute number of transistors added?


>But it is the percentage that matters? If I have 10B transistors and I add 1B to it, the speedup I can expect is 10%. If I have 1B transistors and add 1B to it, the speedup I can expect is 100%. Tremendous difference. Why would I ever care about the absolute number of transistors added?

Because it's the transistors that actually matter for performance.

If a unit of work (number of files compiled, number of triangles generated or whatever else) takes 1 billion transistors to complete in 1 second, you have gained the same amount of work per second by adding 10% to the latter as you gained by adding 100% to the former.

How much performance you need to feel a difference in a given workload is a separate point, and note that usually the workload changes with the hardware. Compiling the linux kernel in 2025 is a different workload than it was in 2005, for example, and running quake 3 is a different workload than cyberpunk 2077.

If you play a modern game, you don't notice a difference between a 10 year old GPU and a 12 year old GPU - even though one might be twice as fast, they might both be in the single digits of FPS wich feels equally useless.

So we gain more from the hardware than we used to, but since the software is doing more work we're not noticing it as much.


I legitimately just do not see the utility of framing the topic this way.

Benchmarking efforts usually involve taking the same amount or type of work, and comparing the runtime durations or throughputs in turn for different hardware.

Rasterization performance benchmarks for the 5090 revealed exactly the same +20% difference we see in transistor count. This is why I do not see the utility in remarking that in absolute terms we're adding more transistors than ever, because this is basically never what matters in practice. I have a set workload and I want it to go some amount faster.

Software sprawl is an issue no doubt, but that on its own is a separate discussion. It bears light relation with the absolute vs. relative differences discussion we're having here.

> How much performance you need to feel a difference in a given workload is a separate point

It was exactly the point I said at the start should be the point of focus. Maybe we're talking past one another, I don't know.


>It was exactly the point I said at the start should be the point of focus. Maybe we're talking past one another, I don't know.

I think we do - we agree that the symptom is that we don't experience the same gains now as we used to, and that is a problem.

My issue is the notion that this is caused by a slowdown in performance gains from the hardware side, when this is clearly not the case. A common complaint is along the lines of "we only got 30% when last time we got 50%", which completely ignores that the latter 30% is way more actual new performance than the previous 50%.

>I legitimately just do not see the utility of framing the topic this way.

IMO it's always useful to identify the actual reason for a problem and think about the fundamentals.

If the problem is that we're not experiencing the performance gains, we should be asking ourselves "Why does software feel slower today despite the hardware having 10x more performance".

Instead we complain about the hardware for not managing to add the equivalent of all previous performance gains every 2 years, because Moore's law observed that it did so in the beginning of the chessboard (so to speak).

Instead of wondering whether Moore's law is dying or not, we should question why Wirth's law seems to be immortal! ;)


You seem to be completely confused about why absolute vs relative matters. Moore's law literally states: "[Moore's law] is the observation that the number of transistors in an integrated circuit (IC) doubles about every two years".

This is literally a relative measurement. You cannot reason about Moore's Law in absolute changes.

The other poster has laid it out for you in the simplest terms: a 1 billion transistor increase could mean anything. It could be a 1000% improvement - which is absolutely humongous - or a 10% improvement, which is basically irrelevant. If you want to measure how impactful an increase is, you have to look at relative change. 1 billion transistors on it's own means nothing. It is only interesting with respect to the number of transistors in the previous generation - which is a relative change measurement.

Say we are at Generation 1 with 100 billion transistors. By your reasoning, if we add 1 more billion of transistors to this, that's big. 1 billion transistors are a lot. But this is absolutely incorrect! Because we started out with 100 billion transistors, the change is actually irrelevant.


Take density per mm^2 going from 122.9 to 125.3 that is 2% increase in little over 2 years. Which does not bode well for needing to double in same period.

Moore’s law never mentioned die area, and not because Gordon thought die sizes would never grow.

> This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat

Can't resist to throw in this quote (from one of Wirth's students I think):

"Software gets slower faster than hardware gets faster."


It is ultimately a market effect, the technical specifics are not really important and are even conflated by industry insiders. See my sibling comment.

Moore's law ends when the whole universe is a computer (which it already is).

https://hasler.ece.gatech.edu/Published_papers/Technology_ov...

Some view it as a doubling every 18 months, or a cost per transistor (this has gone up with the smallest nodes).

It is roughly an exponential curve in the number of transistors we can use to make a "thing" with.

It is both a capability (can we make things of a certain number of transistors) and is it economically viable to build things of that size.

You could stay at the current node size and halve the cost of that wafer every 18 months and you would still be on the same curve. But it is easier in a our economic system to decrease the node size, keeping the rest of the fixed wafer costs the same and get 2x or 4x the density on the same lines.

If I get nerd sniped, I'd find the two video presentations one by Krste and another by Jim Keller where they unambiguously explain Dennard Scaling and Moore's Law in a way that is congruent with what I just said.


> Moore's law ends when the whole universe is a computer (which it already is).

I find "Moore's Second Law" interesting. At least the version I'm familiar with says that the cost of a semiconductor chip fabrication plant doubles every four years. See https://en.wikipedia.org/wiki/Moore%27s_second_law

It's interesting to contrast that trajectory with global GDP. At some point, either global economic growth has to accelerate dramatically to even produce one fab; or we have to leave the 'globe', ie we go into space (but that's still universal GDP exploding), or that law has to break down.

It would be exceedingly funny (to me), if the one of the first two possibilities held true, and would accurately predict either an AI singularity or some Golden space age.


Also, chip fabs keep getting more expensive and taping out a chip design for those new fabs keeps getting more expensive. That makes today's semiconductor industry work quite differently from how things worked 30 years ago and means some segments see reduced or delayed benefits from the continued progression of Moore's Law. See eg. the surprisingly long-lasting commercial relevance of 28nm, 14nm, and 7nm nodes, all of which were used in leading products like desktop GPUs and CPUs for more years than Moore's Law would lead you to expect.

Moore’s will only end when we have practical optical computing, as it fills a critical technological niche.

It’s not impossible, but think I think quantum computing will only become practical later than optical.


> into the third dimension

Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.


He's talking about how they've moved from planar transistors, where layers are just deposited on top of each other, to transisors with increasingly complex 3D structures[1] such as FinFET and Gate-All-Around, the latter having multiple nanowires passing through the gate like an ordered marble cake[2].

[1]: https://www.asml.com/en/news/stories/2022/what-is-a-gate-all...

[2]: https://anysilicon.com/the-ultimate-guide-to-gate-all-around...


There are also cooling and conduction paths taken into account. It was discussed in the design of the xeon version of the i9. Which had me consider clocking down the performance core communication while throttling up the performance cores.

Your sources are excellent. ( Thank you so much for the links. )


Moore's law doesn't say anything about you having to power all your transistors for them to count.

I'm only half-joking: the brain gets a lot of its energy efficiency out of most of its parts not working all that hard most of the time; and we are seeing some glimpses of that in mobile processors, too.


Depends, there could be 11 dimensions to expand into.

Assuming those extra dimensions really exist (it is unproven), I think we are centuries or even millennia away from being able to make technological use of them-if we ever will be at all

Expand into the time dimension, evaluate infinite execution paths by reversing the pipeline, rerunning, and storing the results in a future accumulator.

You give a whole new meaning to branch prediction. I knew you were going there, and I still did not avoid the brain twang.

> since are no more dimensions to expand into.

Quantum computing is next, right?


That’s not how quantum computing works.

To elaborate:

Apart from the very niche application of factoring integers into prime numbers, there's scarcely any application know where quantum computers would even theoretically outperform classical computers. And even integer factoring is only remotely useful, until people completely switch away from cryptography that's relies on it.

The one useful application of quantum computing that I know of is: simulating quantum systems. That's less useless than it sounds: a quantum computer can simulate not just itself (trivially), but also other quantum systems. In any case, the real world use case for that is for accelerating progress in material science, not something you nor me would use everyday.


Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.

The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law

I think software bloat is growing faster, though.


I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...

Speed of light is very relevant for actual performance sensitive software, not just shitty overly chatty trash.

While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.


Two of the brightest minds, Rt Admiral Grace Hopper and Seymore Cray considered this as well as almost every silicon designer in the last 15 years. Feel free to point it out often.

The internet has largely replaced CD Roms, Floppies and DVDs.

The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.

I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s


> Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.

Lots of Facebook and flash games are stored 'in the cloud' (but probably cached locally).

For many games, a big part of the game logic runs on some server somewhere, especially multiplayer games.


On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.

A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.

History of function calls:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run a Large Language Model


Luckily there is plenty room for improvement in most applications.

But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?

/s


you know it’s funny, the first time I ever read a book about compiled languages- I think it might have even been K&R C… it talked about the fact that code is run significantly more often than it’s compiled, and read more often than it’s written. Thus we should optimise for runtime speed and afterwards readability.

Yet on hacker news, the exact opposite sentiment is true.

Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.

Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.

It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.


I've been told it depends on management's goals.

If you are a startup (many HN readers are), employee hours are relatively expensive. The goal is to get a product out as fast as possible, and most of the time you fail at making a working product anyways. Meanwhile, you only have to spend money on servers once you have a successful product. Deferring the problem makes sense here.

On the other hand, I personally write code that'll be near guaranteed to run millions of times in some factory somewhere, and there's little benefit if I ship my code before the factory starts making stuff. I'll take the time to write high-performance and high-quality software because there's now a financial benefit to doing so.

But ultimately, it's my boss who gives me insight into what I should be doing. Not a book, because the book doesn't have context that [company] will lose tens of thousands of dollars every day that I don't fix [issue].


Well, programming is a tool, right? There are clearly incorrect ways to program, but that doesn't mean there's one correct way to program - because it depends on what you're trying to make.

It's incorrect to make a bookshelf that can't hold the weight of books, but IKEA isn't incorrect to make a crappy-but-adequate bookshelf that will get given away when you move, and a master carpenter isn't incorrect to make an expensive-but-beautiful bookshelf that will be used for many decades.

A seed-stage startup trying to see if anyone cares about their product at all should probably accept it being 100x slower than it theoretically could be. And the Linux maintainers should be working hard to shave milliseconds.


It feels like companies care a lot less about delivering quality.

This is just the difference between rational design and capitalist design. Capital must externalize costs to achieve profits, but it's not a rational approach to system design. This appears everywhere in the economy as capitalist pressures override rationality.

Because it saves lives. https://folklore.org/Saving_Lives.html:

"Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”


That assumes booting is a synchronous action instead of asynchronous, giving time to fetch a cup of coffee

By that logic, Electron apps kills people. Makes sense.

Somewhere along the causal chain of "what went wrong leading to this death", almost certainly there will be some cases of a slow app indirectly (maybe even directly in some cases) causing deaths. It's a question of what we mean by "cause".

That's not even counting loss of quality-adjusted life years from time spent staring at a screen waiting, like being stuck in traffic.


Billions of person-hours wasted... Windows Update.

Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.

Jonathan Blow has entered the chat...

Interesting choice of radix-8 booth multiplier that needs the x3 circuit. Seems like a area/perf tradeoff in order to push the fmax which means they had a latency cycles constraint since the same could be achieved via more pipelining.

Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.

Sure it takes a few iterations, but there is nothing inherently complex about it.

You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier

There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).


https://github.com/EI2030/Low-power-E-Paper-OS/blob/master/P...

8086: 29,000 386: 275,000 486: 1.2 million Pentium: 3.1 million The NSA entered the game sometime after 2000, I recall.


I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.


For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.

The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.

For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.

Hopefully this makes sense. There are a lot of digits and sums flying around.


How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?

If you are negating a single term you can't.

If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.


I think that the +1 is accomplished simply by setting carry-in on the adder for that term, so you don't need a separate adder. (Disclaimer: I haven't looked at that part of the circuit yet.)

Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.


TLDR: 3a = (a << 1) + a

From the article:

> Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.

So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.


Can AI come up with this circuit?

Almost certainly, insofar as it’s already known and described in a number of places, including Wikipedia.

Though it would be interesting to run an experiment where you somehow remove all of that from the training data.

Perhaps in 10 years training runs will be cheap enough, that one a whim you could just train a new AI model on data from only before the Pentium was released, and then see if it could come up with it?


A long time ago on the 6502 I was trying to figure out how to quickly multiply by 3 in assembly and I came up with:

1. Shift-left

2. Add original value

For multibyte values use rotate left which carries, and do the rotates from LSB to MSB. Then clear carry, then do the adds from LSB to MSB with carries. Should work for any word size.

Ofc I was a teenager but this seemed clever and simple to me.

Was (am) I just missing something?


This is covered in the article. See the section "Implementing a fast ×3 circuit with carry lookahead"

Thank you. I stopped reading after the octal discussion.

For most assembly, I'm not sure how this has an advantage over two adds.

Shift add used to execute much faster than multiplication in 8088 days and people would use it when they had to multiply an int by a known scalar (shift took 4 clocks and add took 12).

Compilers still prefer LEA to multiply on 3,5 and 9, which performs shift+add I believe.

A shift is faster than an add.

The M68020 added a dedicated barrel shifter, which could execute much faster then the M68000.

On 6502 they are both 2 cycles. On the vast majority of processors I'm aware of, "simple" alu operarations don't vary in execution time

On 8086 or ARM you can get a shift "for free" with your add, although on the 8086 you have to use the LEA instruction.

ADC #immediate is 2 cycles, as is ASL. Are you really multiplying a constant by 3 at runtime? Probably not. ADC zp is 3 cycles. Plus 2 cycles for the CLC.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: