[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

Solid-state storage devices and the block layer

By Jonathan Corbet
October 4, 2010
Over the last few years, it has become clear that one of the most pressing scalability problems faced by Linux is being driven by solid-state storage devices (SSDs). The rapid increase in performance offered by these devices cannot help but reveal any bottlenecks in the Linux filesystem and block layers. What has been less clear, at times, is what we are going to do about this problem. In his LinuxCon Japan talk, block maintainer Jens Axboe described some of the work that has been done to improve block layer scalability and offered a view of where things might go in the future.

While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests. Thus, getting the best results requires being able to perform a large number of I/O operations per second (IOPS). With a high-end rotating drive (running at 15,000 RPM), the maximum rate possible is about 500 IOPS. Most real-world drives, of course, will have significantly slower performance and lower I/O rates.

SSDs, by eliminating seeks and rotational delays, change everything; we have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very short period of time. A number of people have said that the massive increase in IOPS means that the block layer will have to become more like the networking layer, where every bit of per-packet overhead has been squeezed out over time. But, as Jens points out, time is not in great abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s [Jens Axboe] now, the better part of 30 years later. SSDs have forced a similar jump (three orders of magnitude) in a much shorter period of time - and every indication suggests that devices with IOPS rates in the millions are not that far away. The result, says Jens, is "a big problem."

This problem pops up in a number of places, but it usually comes down to contention for shared resources. Locking overhead which is tolerable at 500 IOPS is crippling at 500,000. There are also problems with contention at the hardware level too; vendors of storage controllers have been caught by surprise by SSDs and are having to scramble to get their performance up to the required levels. The growth of multicore systems naturally makes things worse; such systems can create contention problems throughout the kernel, and the block layer is no exception. So much of the necessary work comes down to avoiding contention.

Before that, though, some work had to be done just to get the block layer to recognize that it is dealing with an SSD and react accordingly. Traditionally, the block layer has been driven by the need to avoid head seeks; the use of quite a bit of CPU time could be justified if it managed to avoid a single seek. SSDs - at least the good ones - care a lot less about seeks, so expending a bunch of CPU time to avoid them no longer makes sense. There are various ways of detecting SSDs in the hardware, but they don't always work, especially with the lower-quality devices. So the block layer exports a flag under

    /sys/block/<device>/queue/rotational

which can be used to override the system's notion of what kind of storage device it is dealing with.

Improving performance with SSDs can be a challenging task. There is no single big bottleneck which is causing performance problems; instead, there are numerous small things to fix. Each fix yields a bit of progress, but it mostly serves to highlight the next problem. Additionally, performance testing is hard; results are often not reproducible and can be perturbed by small changes. This is especially true on larger systems with more CPUs. Power management can also get in the way of the generation of consistent results.

One of the first things to address on an SSD was queue plugging. On a rotating disk, the first I/O operation to show up in the request queue will cause the queue to be "plugged," meaning that no operations will actually be dispatched to the hardware. The idea behind plugging is that, by allowing a little time for additional I/O requests to arrive, the block layer will be able to merge adjacent requests (reducing the operation count) and sort them into an optimal order, increasing performance. Performance on SSDs tends not to benefit from this treatment, though there is still a little value to merging requests. Dropping (or, at least, reducing) plugging not only eliminates a needless delay; it also reduces the need to take the queue lock in the process.

Then, there is the issue of request timeouts. Like most I/O code, the block layer needs to notice when an I/O request is never completed by the device. That detection is done with timeouts. The old implementation involved a separate timeout for each outstanding request, but that clearly does not scale when the number of such requests can be huge. The answer was to go to a per-queue timer, reducing the number of running timers considerably.

Block I/O operations, due to their inherently unpredictable execution times, have traditionally contributed entropy to the kernel's random number pool. There is a problem, though: the necessary call to add_timer_randomness() has to acquire a global lock, causing unpleasant systemwide contention. Some work was done to batch these calls and accumulate randomness on a per-CPU basis, but, even when batching 4K operations at a time, the performance cost was significant. On top of it all, it's not really clear that using an SSD as an entropy source makes a lot of sense. SSDs lack mechanical parts moving around, so their completion times are much more predictable. Still, for the moment, SSDs contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.

There are other locking issues to be dealt with. Over time, the block layer has gone from being protected by the big kernel lock to a block-level lock, then to a per-disk lock, but lock contention is still a problem. The I/O scheduler adds contention of its own, especially if it is performing disk-level accounting. Interestingly, contention for the locks themselves is not usually the problem; it's not that the locks are being held for too long. The big problem is the cache-line bouncing caused by moving the lock between processors. So the traditional technique of dropping and reacquiring locks to reduce lock contention does not help here - indeed, it makes things worse. What's needed is to avoid taking the lock altogether.

Block requests enter the system via __make_request(), which is responsible for getting a request (represented by a BIO structure) onto the queue. Two lock acquisitions are required to do this job - three if the CFQ I/O scheduler is in use. Those two acquisitions are the result of a lock split done to reduce contention in the past; that split, when the system is handling requests at SSD speeds, makes things worse. Eliminating it led to a roughly 3% increase in IOPS with a reduction in CPU time on a 32-core system. It is, Jens says, a "quick hack," but it demonstrates the kind of changes that need to be made.

The next step for this patch is to drop the I/O request allocation batching - a mechanism added to increase throughput on rotating drives by allowing the simultaneous submission of multiple requests. Jens also plans to drop the allocation accounting code, which tracks the number of requests in flight at any given time. Counting outstanding I/O operations requires global counters and the associated contention, but it can be done without most of the time. Some accounting will still be done at the request queue level to ensure that some control is maintained over the number of outstanding requests. Beyond that, there is some per-request accounting which can be cleaned up and, Jens thinks, request completion can be made completely lockless. He hopes that this work will be ready for merging into 2.6.38.

Another important technique for reducing contention is keeping processing on the same CPU as often as possible. In particular, there are a number of costs which are incurred if the CPU which handles the submission of a specific I/O request is not the CPU which handles that request's completion. Locks are bounced between CPUs in an unpleasant way, and the slab allocator tends not to respond well when memory allocated on one processor is freed elsewhere in the system. In the networking layer, this problem has been addressed with techniques like receive packet steering, but, unlike some networking hardware, block I/O controllers are not able to direct specific I/O completion interrupts to specific CPUs. So a different solution was required.

That solution took the form of smp_call_function(), which performs fast cross-CPU calls. Using smp_call_function(), the block I/O completion code can direct the completion of specific requests to the CPU where those requests were initially submitted. The result is a relatively easy performance improvement. A dedicated administrator who is willing to tweak the system manually can do better, but that takes a lot of work and the solution tends to be fragile. This code - which was merged back in 2.6.27 and made the default in 2.6.32 - is an easier way that takes away a fair amount of the pain of cross-CPU contention. Jens noted with pride that the block layer was not chasing the networking code with regard to completion steering - the block code had it first.

On the other hand, the blk-iopoll interrupt mitigation code was not just inspired by the networking layer - some of the code was "shamelessly stolen" from there. The blk-iopoll code turns off completion interrupts when I/O traffic is high and uses polling to pick up completed events instead. On a test system, this code reduced 20,000 interrupts/second to about 1,000. Jens says that the results are less conclusive on real-world systems, though.

An approach which "has more merit" is "context plugging," a rework of the queue plugging code. Currently, queue plugging is done implicitly on I/O submission, with an explicit unplug required at a later time. That has been the source of a lot of bugs; forgetting to unplug queues is a common mistake to make. The plan is to make plugging and unplugging fully implicit, but give I/O submitters a way to inform the block layer that more requests are coming soon. It makes the code more clear and robust; it also gets rid of a lot of expensive per-queue state which must be maintained. There are still some problems to be solved, but the code works, is "tasty on many levels," and yields a net reduction of some 600 lines of code. Expect a merge in 2.6.38 or 2.6.39.

Finally, there is the "weird territory" of a multiqueue block layer - an idea which, once again, came from the networking layer. The creation of multiple I/O queues for a given device will allow multiple processors to handle I/O requests simultaneously with less contention. It's currently hard to do, though, because block I/O controllers do not (yet) have multiqueue support. That problem will be fixed eventually, but there will be some other challenges to overcome: I/O barriers will become significantly more complicated, as will per-device accounting. All told, it will require some major changes to the block layer and a special I/O scheduler. Jens offered no guidance as to when we might see this code merged.

The conclusion which comes from this talk is that the Linux block layer is facing some significant challenges driven by hardware changes. These challenges are being addressed, though, and the code is moving in the necessary direction. By the time most of us can afford a system with one of those massive, 1 MIOPS arrays on it, Linux should be able to use it to its potential.

Index entries for this article
KernelBlock layer/Solid-state storage devices
KernelSolid-state storage devices
ConferenceLinuxCon Japan/2010


to post comments

Solid-state storage devices and the block layer

Posted Oct 4, 2010 23:38 UTC (Mon) by nix (subscriber, #2304) [Link] (30 responses)

Still, for the moment, SSDs still contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.
Well, yes, but this isn't in any released kernel yet.

bogus random entropy sources

Posted Oct 5, 2010 5:24 UTC (Tue) by shemminger (subscriber, #5739) [Link] (29 responses)

Actually many types of devices also bogusly report that they provide
entropy when they do not. For example, Xen drivers are purely virtual
and therefore deterministic.

bogus random entropy sources

Posted Oct 5, 2010 6:03 UTC (Tue) by butlerm (subscriber, #13312) [Link] (2 responses)

"For example, Xen drivers are purely virtual and therefore deterministic."

What does that matter, if they ultimately connect to underlying physical devices which are not?

bogus random entropy sources

Posted Oct 5, 2010 6:42 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

_If_. They might not.

bogus random entropy sources

Posted Oct 6, 2010 17:01 UTC (Wed) by drag (guest, #31333) [Link]

I know that I have had problems with ssh hanging on new nodes on xen due lack of entropy. But I think this is no longer a problem.

bogus random entropy sources

Posted Oct 5, 2010 10:25 UTC (Tue) by nix (subscriber, #2304) [Link] (25 responses)

And many devices which often do provide entropy (e.g. the network) were specified to never provide any, because an attacker can sometimes control *some* of the packets on it. I never understood that, and now my headless and virtual systems have next to no entropy at all (or did, until I got an entropy key, and now I don't care where the kernel gets its entropy sources from :) )

bogus random entropy sources

Posted Oct 5, 2010 15:51 UTC (Tue) by jzbiciak (guest, #5246) [Link] (21 responses)

Off-topic rant:

I don't understand why more processors don't include a proper hardware random number generator. It's a classic case of something that is significantly easier to do in hardware, I'd think.

I mean, sure, you could try to derive a few bits of entropy here, an few bits there from what is otherwise a deterministic system. It's maddeningly frustrating, though, and you have to apply new thought and new techniques every time your system assumptions change. Your case is just such a case, and it sounds like you just punted to a dedicated hardware solution.

Modern CPUs have accelerators for all sorts of things as standard equipment. Why not random numbers? We spend countless millions of transistors on ever larger caches and datapaths. Surely they could spare a few for a really high quality true random number generator.

bogus random entropy sources

Posted Oct 5, 2010 17:09 UTC (Tue) by strappe (guest, #53440) [Link] (10 responses)

All VIA x86-compatible processors since the C3 (introduced 2003?) have included a hardware random number generator based on quantum effects; it produces millions of random bits each second, and is accessible with a non-privileged instruction. AFAIK, their opcode choice is unused by either AMD or Intel, so those companies could implement similar functionality (an infinitesimal bit of silicon) and we would have a standard solution at least across the x86 architecture going forward.

bogus random entropy sources

Posted Oct 5, 2010 17:22 UTC (Tue) by jzbiciak (guest, #5246) [Link] (9 responses)

Yeah, I was aware of VIA's accelerator. It boggles me that Intel and AMD bothered to put AES acceleration on their chips without getting something more basic and generic like random numbers on there too. Is it a verification issue? What's holding them back?

bogus random entropy sources

Posted Oct 5, 2010 18:24 UTC (Tue) by ejr (subscriber, #51652) [Link] (3 responses)

It's not as easy as it seems. You can generate random bits, but they are highly skewed, with different skews depending on the temperature, etc. You need to extract a more regular randomness from them, and extractors can require a good bit of space. The extractors I know (theory, not actual architecture) also must be running continually, sucking power.

bogus random entropy sources

Posted Oct 5, 2010 19:10 UTC (Tue) by jzbiciak (guest, #5246) [Link] (2 responses)

VIA's approach on the C3 doesn't sound too unwieldy. This white paper analyzing the generator's output makes for an informative read. The punch line is that it looks like a pretty reasonable source of entropy as long as you do appropriate post processing. The random numbers it generates aren't caveat free, but they're heckuva lot better than disk seeks and keypresses.

bogus random entropy sources

Posted Oct 6, 2010 8:40 UTC (Wed) by pcampe (guest, #28223) [Link] (1 responses)

I don't understand why they didn't follow the guidelines in NIST Standard 800-22 (rev 1a), "A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications".

bogus random entropy sources

Posted Oct 6, 2010 13:56 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Probably because they didn't have a time machine. ;-) The document you reference was written this year. The white paper I reference was written in 2003. And if you meant Rev 1, that didn't come out until 2008.

Maybe you meant the original 800-22? That one came out in 2001.

(Dates came from here.)

bogus random entropy sources

Posted Oct 5, 2010 18:26 UTC (Tue) by mpr22 (subscriber, #60784) [Link] (1 responses)

The AES accelerator probably lets them tick a required-feature box for some government programme or other.

bogus random entropy sources

Posted Oct 5, 2010 18:46 UTC (Tue) by jzbiciak (guest, #5246) [Link]

If anything, it would make it harder for them to export the chips outside of the United States without getting special approval from the Feds. Cryptographic hardware is a munition under ITAR.

I remember there was some concern awhile back when we put our AES implementation in ROM on some devices, because it calculated AES "too quickly" for some peoples' taste. We ended up making that part of the ROM protected (ie. not user accessible) so that it was only used for boot authentication.

bogus random entropy sources

Posted Oct 6, 2010 11:27 UTC (Wed) by intgr (subscriber, #39733) [Link] (2 responses)

> without getting something more basic and generic like random numbers on there too.

The solution has always been obvious to cryptographers. Use a solid cryptographical pseudorandom RNG; as long as there is _some_ truly random data in its input -- 128 or so bits worth -- the output will always be irreversible. As long as this randomness exists, it doesn't matter that the attacker can predict all other input.

In fact, hardware RNGs should _never_ be used directly, because there may be manufacturing flaws or deliberate sabotage. And unlike deterministic algorithms like AES, non-deterministic hardware RNG sources are almost impossible to verify completely. Also it's really quite easy to replace the hw RNG with a deterministic PRNG that passes all randomness tests, yet whose output is entirely predictable to its designer.

So at most, the hw RNG is just one of several randomness sources on any system. As such cryptographers in general don't consider it worthwhile -- only on diskless embedded systems where there really aren't any entropy sources.

Unfortunately /dev/random is a poor legacy choice in Linux that goes against this concept.

bogus random entropy sources

Posted Oct 7, 2010 12:24 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

"Diskless embedded systems" of course includes "all virtual machines". So there are a lot of them.

bogus random entropy sources

Posted Oct 7, 2010 12:48 UTC (Thu) by intgr (subscriber, #39733) [Link]

For virtual machines you already have a paravirtual RNG device called 'virtio-rng' (CONFIG_HW_RANDOM_VIRTIO).

But in general, virtual machine disk I/O still reaches a physical disk sooner or later, so entropy can be successfully gathered from interrupt timings. In some virtualization scenarios, you wouldn't want the VM to access host-CPU-specific features anyway.

bogus random entropy sources

Posted Oct 5, 2010 19:01 UTC (Tue) by patrick_g (subscriber, #44470) [Link] (3 responses)

>>> I don't understand why more processors don't include a proper hardware random number generator. It's a classic case of something that is significantly easier to do in hardware, I'd think.

I think Intel will is working on this.
See these link : http://www.technologyreview.com/computing/25670/

bogus random entropy sources

Posted Oct 6, 2010 3:36 UTC (Wed) by PaulWay (guest, #45600) [Link] (2 responses)

Purely an anecdote, but the other day I had the occasion to use shred to shred two disks at once. The machine was a modern Intel Core Quad system, and the disks were writing at 60MBytes/sec with 3% CPU load. Since modern shred just writes a number of layers of pure random data from /dev/urandom, I have to assume that there was either hardware crypto or randomness generation going on there. Who knew?!

Have fun,

Paul

bogus random entropy sources

Posted Oct 6, 2010 3:47 UTC (Wed) by jzbiciak (guest, #5246) [Link] (1 responses)

Well, /dev/urandom doesn't block when the kernel entropy pool runs out. The hardware crypto acceleration may've been getting used, but that's orthogonal to the question of gathering entropy.

bogus random entropy sources

Posted Oct 6, 2010 19:34 UTC (Wed) by paulj (subscriber, #341) [Link]

Hehe, so shred was using entropy collected from the disk controllers, collected from shred writing to disks..

bogus random entropy sources

Posted Oct 5, 2010 21:58 UTC (Tue) by nowster (subscriber, #67) [Link] (2 responses)

> I don't understand why more processors don't include a proper hardware random number generator.

It's actually a hard problem to provide a cheap reliable hardware random number generator. If you look at the effort that a device like Simtec's Entropy Key takes to ensure that each chunk of randomness it delivers is truly random, you'll see why a random number generator is not something that a CPU designer should drop on a spare corner of a CPU die last thing on a Friday afternoon. Semiconductor junction noise generators can be affected by environmental influences: an RNG on a CPU die running hot might have a bias compared with the same one when the CPU is idle and cooler.

bogus random entropy sources

Posted Oct 6, 2010 3:51 UTC (Wed) by jzbiciak (guest, #5246) [Link] (1 responses)

I linked this whitepaper above on the technique VIA used on its C3. They used multiple free-running oscillators to gather entropy. The resulting output varies in quality, from 0.75 to 0.99 bits of entropy per output bit, depending on the decimation factor used and whether or not you enable von Neumann whitening.

Given that it generates entropy in the megabits/second range, this is several orders better than you can get from hard disk seeks and user keystrokes, even if you have to throw most of the numbers away. And, given the high apparent entropy of the raw bits, you don't really need to throw many away at all.

bogus random entropy sources

Posted Oct 7, 2010 12:28 UTC (Thu) by nix (subscriber, #2304) [Link]

From all accounts I've read, the entropy of the numbers derived from the C3's RNG hardware sucks rather badly, probably because there are so many sources of regular noise in a CPU that it's hard to stop some of them leaking in. The figures I've heard are *well* below 0.75, more like 0.4 if you're lucky. And IIRC the C3 doesn't bother to validate them either (certainly from the description in the whitepaper they don't), and because the pair of oscillators comprise a single system, if it breaks down or becomes coupled to something external you *also* cannot tell.

bogus random entropy sources

Posted Feb 6, 2012 21:33 UTC (Mon) by tconnors (guest, #60528) [Link] (2 responses)

> Modern CPUs have accelerators for all sorts of things as standard equipment. Why not random numbers? We spend countless millions of transistors on ever larger caches and datapaths. Surely they could spare a few for a really high quality true random number generator.

Because random number generators are only used for cryptography, and only terrorists use cryptography. Are you a terrorist?

bogus random entropy sources

Posted Feb 6, 2012 21:40 UTC (Mon) by dlang (guest, #313) [Link]

some chips do have high quality random number generators built in.

bogus random entropy sources

Posted Feb 7, 2012 7:50 UTC (Tue) by cladisch (✭ supporter ✭, #50193) [Link]

The Windows 8 Hardware Certification Requirements demand that "Connected Standby"-capable device (i.e., mobile ones) have encryption acceleration and a RNG.

> Business Justification:
> Core cryptographic functions are used in Windows to provide platform integrity as well as protection of user data.
(note the priorities)

In completely unrelated news, all recent AMD and Intel processors support AES-NI, and Intel has announced that Ivy Bridge processors will have a RNG.

bogus random entropy sources

Posted Oct 7, 2010 14:34 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

Most network controllers now implement interrupt moderation (deferring interrupts so that multiple packets can be handled at once). With a high enough packet rate, they will interrupt at regular and predictable intervals.

Getting more entropy

Posted Oct 10, 2010 11:55 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

A while back just for the fun of it I wrote a kernel driver whose goal was to extract entropy from the timer interrupt. After all, if anything is predictable, then it'd have to be the timer interrupt.

The point is that while the interrupt is predictable, between the time that the interrupt fires and the driver finally gets run you have cache misses at various levels, PCI bus transfers, DRAM refresh cycles and even just hyperthreading making things very unpredictable. Conclusion: if there's predictability here, I couldn't find it (there's a toolkit for estimating randomness, it concluded that the output was indistinguishable from real random data).

The basic idea was to just use the last few bits of the cycle counter, don't worry about the high order bits. The last bit was enough, but even taking the last four bits didn't show any patterns. It might be worth making such a driver for the purpose of giving otherwise entropy starved machines something to work with. I imagine within VMs the cycle counter becomes even more variable, due to contention with things outside the VM.

Getting more entropy

Posted Oct 10, 2010 21:56 UTC (Sun) by man_ls (guest, #15091) [Link]

I guess that the problem is to prove that an attacker cannot influence the timers so that the result is predictable. For example a guy on a different VM doing odd things with the same CPU. As it is hard to prove a negative statement of this kind, then people may tend to distrust such a source of entropy, even if it sounds really interesting.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 1:19 UTC (Tue) by dgc (subscriber, #6611) [Link]

Interesting read. It looks like we are finally getting towards the sort of sort of infrastructure I thought is necessary to support high end SSDs. From the presentation I gave at the 2008 Filesystems and IO workshop (cut-n-paste from my slides, so please excuse lack of formatting):

The IOPS Challenge

o SSDs
- Ready for 50,000 IOPS/s per disk?
+ >200,000 ctxsw/s per disk
+ 50,000 intr/s per disk
+ Does not scale to many disks
- Raw IOP capacity per HBA
+ will be a limiting factor
+ driver design will need to focus on IOPS optimisations,
not achieving max bandwidth
- CPU overhead will be high

o Looks more like the network problem
- similar packet rates to gigabit ethernet per disk
many, many more interfaces than a typical network stack
- HBAs with multiple disks will have to handle packet rates
closer to 10Gb ethernet
- similar interrupt scaling tricks will be needed
+ MSI-X directed interrupts
+ one vector per disk behind the HBA?
+ polling rather than interrupt driven

o Will require both hardware and software to evolve
o Not going to happen overnight
o Two orders of magnitude increase in performance is a big
disconnect
o Optimisations being made for current (cheap) SSDs have a
short life
- random write performance is not a limiting factor at
the high end....

I think this shows the value we have been getting from these workshops - cross pollination of ideas, challenges, techniques, etc across the wider community. We might not see results immediately, but they are eventually appearing...

Solid-state storage devices and the block layer

Posted Oct 5, 2010 8:59 UTC (Tue) by marcH (subscriber, #57642) [Link] (4 responses)

> Networking technology went from 10Mb/s in the 1980's to 10Gb/s now, the better part of 30 years later. SSDs have forced a similar jump (three orders of magnitude) in a much shorter period of time - and every indication suggests that devices with IOPS rates in the millions are not that far away.

Probably the main reason why such an unfortunate IOPS jump has been forced in networking is backward compatibility. Jumbo frames? Fail because of backward compatibility. Evolving TCP/IP to ease hardware assistance? Fail because of backward compatibility. Etc.

That is because the backward compatibility requirement is nowhere as strong as in networking. You can easily upgrade your PC. It is even reasonably easy to upgrade your company-wide software. But good luck trying to upgrade the Internet. Or even just Ethernet. See IPv6 for instance: it comes as a brand new feature practically not touching anything already in place, but even such a smooth "upgrade" is a hard sell!

One of the unfortunate consequences is that transferring a DVD image on the network requires millions of IOPS all across the path.

In comparison, the need for backward compatibility in storage is basically inexistent. So this network/storage analogy must stop somewhere. Please someone from the storage camp tell us where exactly. Surely reading or writing a DVD image to disk does not/will not require millions of IOPS. Or will it still?

Solid-state storage devices and the block layer

Posted Oct 5, 2010 11:13 UTC (Tue) by axboe (subscriber, #904) [Link]

Sequential IO will of course use larger IO sizes. The IOPS quest is largely for the mainly randomized IO workloads.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 18:21 UTC (Tue) by angdraug (subscriber, #7487) [Link] (1 responses)

See IPv6 for instance: it comes as a brand new feature practically not touching anything already in place, but even such a smooth "upgrade" is a hard sell!

Have you seen this article at ArsTechnica? It goes to some lengths to explain the problems with IPv6 transition. If it's to be believed, IPv6 transition is quite far from "smooth".

Solid-state storage devices and the block layer

Posted Oct 5, 2010 23:30 UTC (Tue) by marcH (subscriber, #57642) [Link]

> If it's to be believed, IPv6 transition is quite far from "smooth".

Yes but it would have been much worse (read: impossible) if IPv6 deployment ever required substantial changes to IPv4.

This is an interesting article. Except they are wrong when they pretend it is easy to break backward-compatibility with Ethernet or TCP. It is not easy but only "less impossible" than breaking IPv4 backward compatibility.

Note: the focus of the article is obviously neither on Ethernet nor on TCP.

Solid-state storage devices and the block layer

Posted Oct 8, 2010 23:48 UTC (Fri) by giraffedata (guest, #1954) [Link]

Probably the main reason why such an unfortunate IOPS jump has been forced in networking is backward compatibility.

...

In comparison, the need for backward compatibility in storage is basically inexistent.

Well, the the whole reason SSDs exist is backward compatibility with rotating media, and it does slow things down considerably. If not for backward compatibility, we wouldn't use SCSI or even Linux block devices to access solid state storage. Write amplification by read-modify-write wouldn't be a problem if the device weren't trying to emulate a 512-byte-sectored disk drive.

Existence of SSDs tells me people aren't willing to replace the entire system at once -- they want to replace just the disk drives.

Not knowing the network issues, though, I can believe that backward compatibilty hinders performance less in storage than for ethernet.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 10:24 UTC (Tue) by mjthayer (guest, #39183) [Link] (4 responses)

It is rather nice that most of the performance work looks like undoing tricks to make rotational media work faster.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 10:49 UTC (Tue) by hmh (subscriber, #3838) [Link] (3 responses)

Unless you're the owner of rotational media, and the optimizations are being permanently undone (instead of being just disabled for non-rotational media).

Solid-state storage devices and the block layer

Posted Oct 5, 2010 10:57 UTC (Tue) by mjthayer (guest, #39183) [Link] (2 responses)

> Unless you're the owner of rotational media, and the optimizations are being permanently undone (instead of being just disabled for non-rotational media).
Is that really the case? I have trouble imagining that they are willing to drop support for rotational media quite this fast.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 11:13 UTC (Tue) by hmh (subscriber, #3838) [Link]

I don't know. I sure hope it isn't the case, we will need fast, optimized support for rotational media for a few years yet...

Solid-state storage devices and the block layer

Posted Oct 5, 2010 11:15 UTC (Tue) by axboe (subscriber, #904) [Link]

Optimizations for rotating media are not dropped. Some of the early SSD work was centered around detecting them properly, that gives you a way to make informed decisions on these optimizations.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 15:54 UTC (Tue) by jzbiciak (guest, #5246) [Link] (10 responses)

Out of curiosity: Do we ever see SSDs start looking a more like RAM and a less like disks? Already mmap() and friends blur the line between what's RAM and what's disk. What if this relationship became shallower?

Solid-state storage devices and the block layer

Posted Oct 5, 2010 17:29 UTC (Tue) by strappe (guest, #53440) [Link] (9 responses)

A "universal" memory technology has been the holy grail for decades: fast as SRAM, density and non-volatility of Flash, and cost of DRAM. There are various technologies that combine at least some of these characteristics: Magneto-resistive (MRAM), ferroelectric (FRAM), phase-change memory (PCM), programmable metalization cell (PMC) and resistive (RRAM). Whether any of these will be commercially viable is still unknown.

I can easily imagine that flash will displace hard drives in most laptops and desktops, but server farms are still going to need massive amounts of cheap storage. Rotating media still has a huge lead in $/bit (100X) so I don't think it will be displaced in there any time soon.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 18:04 UTC (Tue) by jzbiciak (guest, #5246) [Link] (8 responses)

I was thinking more in terms of treating flash specifically as less like an "I/O" device and more like a slow memory. I have no doubt that spinning rust will be around for awhile--a decade or more at least. It just seems like wrapping the flash behind a "disk drive" abstraction in hardware puts some artificial upper limits on how well it can perform. It's acceptable with spinning rust because the electronics are so much faster. When you go all solid-state, it just feels like a bottleneck.

Imagine what would happen if the immense creativity of the kernel crowd were unleashed on the problem of load balancing writes, erases and reads across a parallel array of raw flash modules?

Approaches such as UBI/UBIFS sound rather promising. I generally like the idea of owning the problem in kernel space, where it seems like we ought to be able do much more deliberate and proactive scheduling.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 18:36 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

the thing is that flash is not random access memory.

the requirement to do bulk deletes makes it far more like spinning disks than ram.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 19:27 UTC (Tue) by jzbiciak (guest, #5246) [Link] (6 responses)

It certainly is random access. I can generally send a command for address X followed by a command for address Y to the same chip, where the response time is not a function of the distance between X and Y, except when they overlap. Instead, the performance is most strongly determined by what commands I sent[*]. Reads are much faster than writes, and both are much, much faster than sector erase.

The opposite is generally true of disks. There, the cost of an operation is more strongly determined by whether it triggered a seek (and how far the seek went) than if the operation was a read or a write. Both reads and writes require getting the head to a particular position on the platter, ignoring any cache that might be built into the drive. Also, under normal operation, spinning-rust drives don't really have an analog to "sector erase." (Yes, there's the old "low-level format" commands, but those aren't generally used during normal filesystem operation.)


[*] Ok, so that's not 100% true, but essentially true in the current context. NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages. But, if you're doing block transfers, such as 512-byte sector reads, you're reading the whole page. Hopping between any two sectors always costs about the same. Here, read a data sheet! For this particular flash, a random sector read is 10us, sector write is 250us, and page erase is 2ms. The whole page-open/page-close architecture makes it look much more like modern SDRAM than disk.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 19:42 UTC (Tue) by dlang (guest, #313) [Link] (4 responses)

flash allows for random access reads, but is much more limited for writes.

Solid-state storage devices and the block layer

Posted Oct 5, 2010 20:38 UTC (Tue) by jzbiciak (guest, #5246) [Link] (3 responses)

You can do random writes to random empty sectors. Again, that's nothing like how a hard disk works. I'm still strenuously disagreeing with your earlier statement that flash's properties make it more like a disk than like RAM. It's really an entirely different beast worthy of separate consideration, which is why I think wrapping it up in an SSD limits its potential.

With flash, you need entirely new strategies that apply neither to disks nor RAM to get the full benefit from the technology. Much of the effort spent on disks revolves (no pun intended) around eliminating seeks. No such effort is required with RAM or with flash. Flash does require you to think about how you pool your free sectors, though, and how you schedule writing versus erasing. I won't deny that. Rather, I say it only further invalidates your original conjecture that it makes flash more like disks. (I will agree it makes it less like RAM though.)

Because seeks are "free", I could totally see load balancing algorithms of the form "write this block to the youngest free sector on the first available flash device", so that a new write doesn't get held up by devices busy with block erases. That looks nothing like what you'd want to do with a disk. It takes advantage of the "free seek" property of the flash while helping to hide the block erase penalty it imposes. Neither property is a property of a disk drive. Of course, neither property is a property of RAM, either.

Am I splitting hairs over semantics here? Let me step back and summarize, and see if you agree: Raw flash's random access capability and relatively low access time can make it much more like RAM than disk, especially in terms of bandwidth and latency. Raw flash's limitations on writes, however, require the OS to have flash-specific write strategies. They prevent the OS from treating flash identically to RAM, and will require careful thought to be handled correctly. This is similar to how we had to put careful thought into disk scheduling algorithms, even if flash requires entirely different algorithms to address its unique properties.

Solid-state storage devices and the block layer

Posted Oct 9, 2010 14:10 UTC (Sat) by joern (guest, #22392) [Link] (2 responses)

> Flash does require you to think about how you pool your free sectors, though, and how you schedule writing versus erasing.

Intriguing. Can you elaborate a bit? What difference does it make vs. the naïve approach of erasing before writing?

Solid-state storage devices and the block layer

Posted Oct 9, 2010 14:55 UTC (Sat) by dlang (guest, #313) [Link]

the issue is that you have to erase large chunks (on the order of 128K bytes), if you are then writing in small chunks (say the 512 byte sectors that are the default, or even the 4K byte filesystem blocks) you can't just erase just before writing.

you also have the problem that erasing takes a significant amount of time and power to accomplish, so you don't want to wait until you need to erase to do so and you don't want to erase when you don't need to and are on battery

Solid-state storage devices and the block layer

Posted Oct 9, 2010 15:03 UTC (Sat) by jzbiciak (guest, #5246) [Link]

Note: I'm not an expert. Please do not mistake me for one. :-) Here are my observations, though, along with things I've read elsewhere.

Flash requires wear leveling in order to maximize its life. For the greatest effect, you want to wear level across the entire device, which means picking up and moving otherwise quiescent data so that each sector sees approximately the same number of erasures. That's one aspect.

Another aspect is that erase blocks are generally much larger than write sectors. So, when you do erase, you end up erasing quite a lot. Furthermore erasure is about an order of magnitude slower than writing, and writing is about an order of magnitude slower than reading. For a random flash device whose data sheet I just pulled up, a random read takes 25us, page program takes 300us, and block erase takes 2ms. Pages are 2K bytes, whereas erase blocks are 128K bytes.

(Warning: This is where I get speculative!) And finally, if you have multiple flash devices (or multiple independent zones on the same flash device), you can take advantage of that fact and the fact that "seeks are free" by redirecting writes to idle flash units if others are busy. That's probably the most interesting area to explore algorithmically, IMO. Given that an erase operation can take a device out of commission for 2ms, picking which device to start an erase operation on and when to do it can have a pretty big impact on performance. If you can do background erase on idle devices, for example, then you can hide the cost.

Solid-state storage devices and the block layer

Posted Oct 7, 2010 12:38 UTC (Thu) by nix (subscriber, #2304) [Link]

NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages.
That sounds just like normal RAM: if you don't have to specify the row *and* column, you save on one CAS/RAS select cycle. Of course this is hidden behind the MMU and CPU cache management code and so on, so we don't often notice it, but it is there.

application impact

Posted Oct 5, 2010 18:07 UTC (Tue) by wingo (guest, #26929) [Link] (6 responses)

I wonder what these capabilities mean for user-space. I spend some time optimizing disk access in my programs, and for what?

I asked Michael Meeks a couple of Fosdems ago about how his iogrind disk profiler was coming, and he said that he totally dropped it, because ssds will kill all these issues. Sounds easier than fixing OpenOffice.org^WLibreOffice issues in code...

Is the "best practice" going to shift away from implementing things like GTK's icon cache and other purely seek-avoiding caches?

application impact

Posted Oct 5, 2010 22:29 UTC (Tue) by zlynx (guest, #2285) [Link] (5 responses)

I sure hope not.

GTK applications' current "best practice" of "ignore the RAM use, they can buy more" has already destroyed the usefulness of old hardware with a modern Linux software stack.

application impact

Posted Oct 6, 2010 0:16 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (3 responses)

Eight Megabytes And Constantly Swapping. This is not a new phenomenon.

application impact

Posted Oct 6, 2010 1:23 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

the problem is that system resources have increased by 1000x (or close to it) and people trying to do very similar work find themselves in almost the same situation.

yes we are doing more with our systems, but nowhere near that much more.

application impact

Posted Oct 6, 2010 9:23 UTC (Wed) by marcH (subscriber, #57642) [Link] (1 responses)

I doubt that hard drive performance (as considered in this article) has increased 1000x. Has it? The memory hierarchy looks more and more stretched.

(Here I am ignoring SSDs, still too new to be part of The History)

application impact

Posted Oct 6, 2010 11:04 UTC (Wed) by dlang (guest, #313) [Link]

it depends on what you are measureing

in terms of size, drives have grown at least 1000x

in terms of sequential I/O speeds they have improved drastically (I don't think quite 1000x, but probably well over 100x, so I think it's in the ballpark)

in terms of seek time, they've barely improved 10x or so

this is ignoring things like SSDs, high-end raid controllers (with battery backed NVRAM caches) and so on which distort performance numbers upwards.

byt yes, the performance difference between the CPU registers and disk speeds is being stretched over time.

jut the difference in speed between the registers and ram is getting stretched to the point where people are seriously talking that it may be a good idea to start thinking of ram as a block device, accessed in blocks of 128-256 bytes (the cache line size for the CPU), right now the CPU hides this from you by 'transparently' moving the blocks in and out of the cache of the various processors for you so that if you choose to you can ignore this.

but when you are really after performance, a high end system starts looking very strange. You have several sets of processors that share a small amount of high-speed storage (L2/L3 cache) and have larger amount of lower speed storage (the memory directly connected to that CPU), plus a network to access the lower speed storage connected to other CPUs. Then you have a lower speed network to talk to the southbridge chipset to interact with the outside world (things like you monitor/keyboard/disk drives, PCI-e cards, etc).

This is a rough description of NUMA and the types of things that you can run into on large multi-socket systems, but the effect starts showing up on surprisingly small systems (which is why per-cpu variables and such things are used so frequently)

application impact

Posted Oct 14, 2010 19:29 UTC (Thu) by Wol (subscriber, #4433) [Link]

That's fine until they're on a system like mine ...

Three slots, max capacity 256Mb per slot, three 256Mb chips in the machine.

"That's no problem, they can just buy a new machine ..."

Cheers,
Wol

Solid-state storage devices and the block layer

Posted Oct 5, 2010 18:48 UTC (Tue) by jmm82 (guest, #59425) [Link]

I enjoyed the article and most of the ideas seem very logical. Is there any plans to generalize any of the code between the networking and block io or would that be too complex to maintain?

iSCSI, Solid-state storage devices and the block layer

Posted Oct 5, 2010 21:17 UTC (Tue) by jhhaller (guest, #56103) [Link]

Is using iSCSI to use Solid State disks mounted on a file server part of the testing and improvement plan? I can imagine this stresses both worlds, namely network interrupt steering along with block devices, all interacting in less than obvious ways.

Solid-state storage devices and the block layer

Posted Oct 6, 2010 22:05 UTC (Wed) by eds (guest, #69511) [Link]

Good article.

At the extreme high end of PCIe SSDs, a system trying to do lots of small (4k) reads with high parallelism will be limited by having any queue locking at all. Running without a request queue remains an attractive option for these devices.

Another future improvement to watch out for is MSI-X interrupts. With MSI-X, it is possible to statically assign an interrupt to a single CPU core in such a way that an I/O retirement could interrupt the originating CPU directly; over about 600K IOPS it becomes important to spread out the interrupt/retirement workload as much as possible.

Solid-state storage devices: most I/O patterns

Posted Oct 9, 2010 0:00 UTC (Sat) by giraffedata (guest, #1954) [Link]

While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests.

There are so many ways to count "most" that this fact is pretty useless. Jens should just say, "some important I/O patterns are ...," which is reason enough to do this work.

I see a lot of thought wasted prioritizing things based on arbitrary "mosts": Most I/Os are reads, most files are under 4K, most computers are personal workstations.

Solid-state storage devices and the block layer

Posted Oct 15, 2010 17:42 UTC (Fri) by jmy3056 (guest, #70648) [Link] (1 responses)

I think the analogy presented misses the mark. Instead of equating block IO with Network improvements consider this.

Media that stores electronic information that used to spin but now doesn't is a closer parallel with RAM. Optimzations for "disk" IO need to follow a similar path as OS/kernels when dealing with RAM.

Solid-state storage devices and the block layer

Posted Oct 22, 2010 22:04 UTC (Fri) by eds (guest, #69511) [Link]

There are many good reasons to treat NAND flash storage more like disk than like DRAM.

1. Addressing: DRAM is byte/word addressable; NAND flash is not. NAND flash pages are currently 4KB in size and must be read/written
2. Flash management: flash sucks. It has long erase times, needs wear-leveling, needs lots of ECC and redundancy to be reliable. Dealing with flash requires a lot of careful management that nobody's going to want on a DRAM-like path.
3. Speed: flash is a lot faster than disk. But it's still a lot slower than DRAM (a write to a busy NAND part may have to wait up to 1ms).
4. Size: it's very expensive to try to address a terabyte of DRAM. 64-bit CPUs don't actually implement a full 64-bit address space. It's much cheaper to just address huge storage devices in blocks, like a disk.

If in a few more years phase-change memory becomes big and cheap enough to give NAND flash a run for its money, then it may be time to start treating nonvolatile memory sort of like DRAM. But that day isn't quite here yet.


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds