Solid-state storage devices and the block layer
While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests. Thus, getting the best results requires being able to perform a large number of I/O operations per second (IOPS). With a high-end rotating drive (running at 15,000 RPM), the maximum rate possible is about 500 IOPS. Most real-world drives, of course, will have significantly slower performance and lower I/O rates.
SSDs, by eliminating seeks and rotational delays, change everything; we have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very short period of time. A number of people have said that the massive increase in IOPS means that the block layer will have to become more like the networking layer, where every bit of per-packet overhead has been squeezed out over time. But, as Jens points out, time is not in great abundance. Networking technology went from 10Mb/s in the 1980's to 10Gb/s now, the better part of 30 years later. SSDs have forced a similar jump (three orders of magnitude) in a much shorter period of time - and every indication suggests that devices with IOPS rates in the millions are not that far away. The result, says Jens, is "a big problem."
This problem pops up in a number of places, but it usually comes down to contention for shared resources. Locking overhead which is tolerable at 500 IOPS is crippling at 500,000. There are also problems with contention at the hardware level too; vendors of storage controllers have been caught by surprise by SSDs and are having to scramble to get their performance up to the required levels. The growth of multicore systems naturally makes things worse; such systems can create contention problems throughout the kernel, and the block layer is no exception. So much of the necessary work comes down to avoiding contention.
Before that, though, some work had to be done just to get the block layer to recognize that it is dealing with an SSD and react accordingly. Traditionally, the block layer has been driven by the need to avoid head seeks; the use of quite a bit of CPU time could be justified if it managed to avoid a single seek. SSDs - at least the good ones - care a lot less about seeks, so expending a bunch of CPU time to avoid them no longer makes sense. There are various ways of detecting SSDs in the hardware, but they don't always work, especially with the lower-quality devices. So the block layer exports a flag under
/sys/block/<device>/queue/rotational
which can be used to override the system's notion of what kind of storage device it is dealing with.
Improving performance with SSDs can be a challenging task. There is no single big bottleneck which is causing performance problems; instead, there are numerous small things to fix. Each fix yields a bit of progress, but it mostly serves to highlight the next problem. Additionally, performance testing is hard; results are often not reproducible and can be perturbed by small changes. This is especially true on larger systems with more CPUs. Power management can also get in the way of the generation of consistent results.
One of the first things to address on an SSD was queue plugging. On a rotating disk, the first I/O operation to show up in the request queue will cause the queue to be "plugged," meaning that no operations will actually be dispatched to the hardware. The idea behind plugging is that, by allowing a little time for additional I/O requests to arrive, the block layer will be able to merge adjacent requests (reducing the operation count) and sort them into an optimal order, increasing performance. Performance on SSDs tends not to benefit from this treatment, though there is still a little value to merging requests. Dropping (or, at least, reducing) plugging not only eliminates a needless delay; it also reduces the need to take the queue lock in the process.
Then, there is the issue of request timeouts. Like most I/O code, the block layer needs to notice when an I/O request is never completed by the device. That detection is done with timeouts. The old implementation involved a separate timeout for each outstanding request, but that clearly does not scale when the number of such requests can be huge. The answer was to go to a per-queue timer, reducing the number of running timers considerably.
Block I/O operations, due to their inherently unpredictable execution times, have traditionally contributed entropy to the kernel's random number pool. There is a problem, though: the necessary call to add_timer_randomness() has to acquire a global lock, causing unpleasant systemwide contention. Some work was done to batch these calls and accumulate randomness on a per-CPU basis, but, even when batching 4K operations at a time, the performance cost was significant. On top of it all, it's not really clear that using an SSD as an entropy source makes a lot of sense. SSDs lack mechanical parts moving around, so their completion times are much more predictable. Still, for the moment, SSDs contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.
There are other locking issues to be dealt with. Over time, the block layer has gone from being protected by the big kernel lock to a block-level lock, then to a per-disk lock, but lock contention is still a problem. The I/O scheduler adds contention of its own, especially if it is performing disk-level accounting. Interestingly, contention for the locks themselves is not usually the problem; it's not that the locks are being held for too long. The big problem is the cache-line bouncing caused by moving the lock between processors. So the traditional technique of dropping and reacquiring locks to reduce lock contention does not help here - indeed, it makes things worse. What's needed is to avoid taking the lock altogether.
Block requests enter the system via __make_request(), which is responsible for getting a request (represented by a BIO structure) onto the queue. Two lock acquisitions are required to do this job - three if the CFQ I/O scheduler is in use. Those two acquisitions are the result of a lock split done to reduce contention in the past; that split, when the system is handling requests at SSD speeds, makes things worse. Eliminating it led to a roughly 3% increase in IOPS with a reduction in CPU time on a 32-core system. It is, Jens says, a "quick hack," but it demonstrates the kind of changes that need to be made.
The next step for this patch is to drop the I/O request allocation batching - a mechanism added to increase throughput on rotating drives by allowing the simultaneous submission of multiple requests. Jens also plans to drop the allocation accounting code, which tracks the number of requests in flight at any given time. Counting outstanding I/O operations requires global counters and the associated contention, but it can be done without most of the time. Some accounting will still be done at the request queue level to ensure that some control is maintained over the number of outstanding requests. Beyond that, there is some per-request accounting which can be cleaned up and, Jens thinks, request completion can be made completely lockless. He hopes that this work will be ready for merging into 2.6.38.
Another important technique for reducing contention is keeping processing on the same CPU as often as possible. In particular, there are a number of costs which are incurred if the CPU which handles the submission of a specific I/O request is not the CPU which handles that request's completion. Locks are bounced between CPUs in an unpleasant way, and the slab allocator tends not to respond well when memory allocated on one processor is freed elsewhere in the system. In the networking layer, this problem has been addressed with techniques like receive packet steering, but, unlike some networking hardware, block I/O controllers are not able to direct specific I/O completion interrupts to specific CPUs. So a different solution was required.
That solution took the form of smp_call_function(), which performs fast cross-CPU calls. Using smp_call_function(), the block I/O completion code can direct the completion of specific requests to the CPU where those requests were initially submitted. The result is a relatively easy performance improvement. A dedicated administrator who is willing to tweak the system manually can do better, but that takes a lot of work and the solution tends to be fragile. This code - which was merged back in 2.6.27 and made the default in 2.6.32 - is an easier way that takes away a fair amount of the pain of cross-CPU contention. Jens noted with pride that the block layer was not chasing the networking code with regard to completion steering - the block code had it first.
On the other hand, the blk-iopoll interrupt mitigation code was not just inspired by the networking layer - some of the code was "shamelessly stolen" from there. The blk-iopoll code turns off completion interrupts when I/O traffic is high and uses polling to pick up completed events instead. On a test system, this code reduced 20,000 interrupts/second to about 1,000. Jens says that the results are less conclusive on real-world systems, though.
An approach which "has more merit" is "context plugging," a rework of the queue plugging code. Currently, queue plugging is done implicitly on I/O submission, with an explicit unplug required at a later time. That has been the source of a lot of bugs; forgetting to unplug queues is a common mistake to make. The plan is to make plugging and unplugging fully implicit, but give I/O submitters a way to inform the block layer that more requests are coming soon. It makes the code more clear and robust; it also gets rid of a lot of expensive per-queue state which must be maintained. There are still some problems to be solved, but the code works, is "tasty on many levels," and yields a net reduction of some 600 lines of code. Expect a merge in 2.6.38 or 2.6.39.
Finally, there is the "weird territory" of a multiqueue block layer - an idea which, once again, came from the networking layer. The creation of multiple I/O queues for a given device will allow multiple processors to handle I/O requests simultaneously with less contention. It's currently hard to do, though, because block I/O controllers do not (yet) have multiqueue support. That problem will be fixed eventually, but there will be some other challenges to overcome: I/O barriers will become significantly more complicated, as will per-device accounting. All told, it will require some major changes to the block layer and a special I/O scheduler. Jens offered no guidance as to when we might see this code merged.
The conclusion which comes from this talk is that the Linux block layer is
facing some significant challenges driven by hardware changes. These
challenges are being addressed, though, and the code is moving in the
necessary direction. By the time most of us can afford a system with one
of those massive, 1 MIOPS arrays on it, Linux should be able to use it
to its potential.
Index entries for this article | |
---|---|
Kernel | Block layer/Solid-state storage devices |
Kernel | Solid-state storage devices |
Conference | LinuxCon Japan/2010 |
Posted Oct 4, 2010 23:38 UTC (Mon)
by nix (subscriber, #2304)
[Link] (30 responses)
Posted Oct 5, 2010 5:24 UTC (Tue)
by shemminger (subscriber, #5739)
[Link] (29 responses)
Posted Oct 5, 2010 6:03 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (2 responses)
What does that matter, if they ultimately connect to underlying physical devices which are not?
Posted Oct 5, 2010 6:42 UTC (Tue)
by smurf (subscriber, #17840)
[Link] (1 responses)
Posted Oct 6, 2010 17:01 UTC (Wed)
by drag (guest, #31333)
[Link]
Posted Oct 5, 2010 10:25 UTC (Tue)
by nix (subscriber, #2304)
[Link] (25 responses)
Posted Oct 5, 2010 15:51 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (21 responses)
Off-topic rant:
I don't understand why more processors don't include a proper hardware random number generator. It's a classic case of something that is significantly easier to do in hardware, I'd think.
I mean, sure, you could try to derive a few bits of entropy here, an few bits there from what is otherwise a deterministic system. It's maddeningly frustrating, though, and you have to apply new thought and new techniques every time your system assumptions change. Your case is just such a case, and it sounds like you just punted to a dedicated hardware solution.
Modern CPUs have accelerators for all sorts of things as standard equipment. Why not random numbers? We spend countless millions of transistors on ever larger caches and datapaths. Surely they could spare a few for a really high quality true random number generator.
Posted Oct 5, 2010 17:09 UTC (Tue)
by strappe (guest, #53440)
[Link] (10 responses)
Posted Oct 5, 2010 17:22 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (9 responses)
Posted Oct 5, 2010 18:24 UTC (Tue)
by ejr (subscriber, #51652)
[Link] (3 responses)
Posted Oct 5, 2010 19:10 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (2 responses)
VIA's approach on the C3 doesn't sound too unwieldy. This white paper analyzing the generator's output makes for an informative read. The punch line is that it looks like a pretty reasonable source of entropy as long as you do appropriate post processing. The random numbers it generates aren't caveat free, but they're heckuva lot better than disk seeks and keypresses.
Posted Oct 6, 2010 8:40 UTC (Wed)
by pcampe (guest, #28223)
[Link] (1 responses)
Posted Oct 6, 2010 13:56 UTC (Wed)
by jzbiciak (guest, #5246)
[Link]
Probably because they didn't have a time machine. ;-) The document you reference was written this year. The white paper I reference was written in 2003. And if you meant Rev 1, that didn't come out until 2008.
Maybe you meant the original 800-22? That one came out in 2001.
(Dates came from here.)
Posted Oct 5, 2010 18:26 UTC (Tue)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Posted Oct 5, 2010 18:46 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
If anything, it would make it harder for them to export the chips outside of the United States without getting special approval from the Feds. Cryptographic hardware is a munition under ITAR.
I remember there was some concern awhile back when we put our AES implementation in ROM on some devices, because it calculated AES "too quickly" for some peoples' taste. We ended up making that part of the ROM protected (ie. not user accessible) so that it was only used for boot authentication.
Posted Oct 6, 2010 11:27 UTC (Wed)
by intgr (subscriber, #39733)
[Link] (2 responses)
The solution has always been obvious to cryptographers. Use a solid cryptographical pseudorandom RNG; as long as there is _some_ truly random data in its input -- 128 or so bits worth -- the output will always be irreversible. As long as this randomness exists, it doesn't matter that the attacker can predict all other input.
In fact, hardware RNGs should _never_ be used directly, because there may be manufacturing flaws or deliberate sabotage. And unlike deterministic algorithms like AES, non-deterministic hardware RNG sources are almost impossible to verify completely. Also it's really quite easy to replace the hw RNG with a deterministic PRNG that passes all randomness tests, yet whose output is entirely predictable to its designer.
So at most, the hw RNG is just one of several randomness sources on any system. As such cryptographers in general don't consider it worthwhile -- only on diskless embedded systems where there really aren't any entropy sources.
Unfortunately /dev/random is a poor legacy choice in Linux that goes against this concept.
Posted Oct 7, 2010 12:24 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Oct 7, 2010 12:48 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
But in general, virtual machine disk I/O still reaches a physical disk sooner or later, so entropy can be successfully gathered from interrupt timings. In some virtualization scenarios, you wouldn't want the VM to access host-CPU-specific features anyway.
Posted Oct 5, 2010 19:01 UTC (Tue)
by patrick_g (subscriber, #44470)
[Link] (3 responses)
Posted Oct 6, 2010 3:36 UTC (Wed)
by PaulWay (guest, #45600)
[Link] (2 responses)
Have fun,
Paul
Posted Oct 6, 2010 3:47 UTC (Wed)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Well, /dev/urandom doesn't block when the kernel entropy pool runs out. The hardware crypto acceleration may've been getting used, but that's orthogonal to the question of gathering entropy.
Posted Oct 6, 2010 19:34 UTC (Wed)
by paulj (subscriber, #341)
[Link]
Posted Oct 5, 2010 21:58 UTC (Tue)
by nowster (subscriber, #67)
[Link] (2 responses)
It's actually a hard problem to provide a cheap reliable hardware random number generator. If you look at the effort that a device like Simtec's Entropy Key takes to ensure that each chunk of randomness it delivers is truly random, you'll see why a random number generator is not something that a CPU designer should drop on a spare corner of a CPU die last thing on a Friday afternoon. Semiconductor junction noise generators can be affected by environmental influences: an RNG on a CPU die running hot might have a bias compared with the same one when the CPU is idle and cooler.
Posted Oct 6, 2010 3:51 UTC (Wed)
by jzbiciak (guest, #5246)
[Link] (1 responses)
I linked this whitepaper above on the technique VIA used on its C3. They used multiple free-running oscillators to gather entropy. The resulting output varies in quality, from 0.75 to 0.99 bits of entropy per output bit, depending on the decimation factor used and whether or not you enable von Neumann whitening. Given that it generates entropy in the megabits/second range, this is several orders better than you can get from hard disk seeks and user keystrokes, even if you have to throw most of the numbers away. And, given the high apparent entropy of the raw bits, you don't really need to throw many away at all.
Posted Oct 7, 2010 12:28 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Feb 6, 2012 21:33 UTC (Mon)
by tconnors (guest, #60528)
[Link] (2 responses)
Because random number generators are only used for cryptography, and only terrorists use cryptography. Are you a terrorist?
Posted Feb 6, 2012 21:40 UTC (Mon)
by dlang (guest, #313)
[Link]
Posted Feb 7, 2012 7:50 UTC (Tue)
by cladisch (✭ supporter ✭, #50193)
[Link]
> Business Justification:
In completely unrelated news, all recent AMD and Intel processors support AES-NI, and Intel has announced that Ivy Bridge processors will have a RNG.
Posted Oct 7, 2010 14:34 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
Posted Oct 10, 2010 11:55 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (1 responses)
The point is that while the interrupt is predictable, between the time that the interrupt fires and the driver finally gets run you have cache misses at various levels, PCI bus transfers, DRAM refresh cycles and even just hyperthreading making things very unpredictable. Conclusion: if there's predictability here, I couldn't find it (there's a toolkit for estimating randomness, it concluded that the output was indistinguishable from real random data).
The basic idea was to just use the last few bits of the cycle counter, don't worry about the high order bits. The last bit was enough, but even taking the last four bits didn't show any patterns. It might be worth making such a driver for the purpose of giving otherwise entropy starved machines something to work with. I imagine within VMs the cycle counter becomes even more variable, due to contention with things outside the VM.
Posted Oct 10, 2010 21:56 UTC (Sun)
by man_ls (guest, #15091)
[Link]
Posted Oct 5, 2010 1:19 UTC (Tue)
by dgc (subscriber, #6611)
[Link]
The IOPS Challenge
o SSDs
o Looks more like the network problem
o Will require both hardware and software to evolve
I think this shows the value we have been getting from these workshops - cross pollination of ideas, challenges, techniques, etc across the wider community. We might not see results immediately, but they are eventually appearing...
Posted Oct 5, 2010 8:59 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (4 responses)
Probably the main reason why such an unfortunate IOPS jump has been forced in networking is backward compatibility. Jumbo frames? Fail because of backward compatibility. Evolving TCP/IP to ease hardware assistance? Fail because of backward compatibility. Etc.
That is because the backward compatibility requirement is nowhere as strong as in networking. You can easily upgrade your PC. It is even reasonably easy to upgrade your company-wide software. But good luck trying to upgrade the Internet. Or even just Ethernet. See IPv6 for instance: it comes as a brand new feature practically not touching anything already in place, but even such a smooth "upgrade" is a hard sell!
One of the unfortunate consequences is that transferring a DVD image on the network requires millions of IOPS all across the path.
In comparison, the need for backward compatibility in storage is basically inexistent. So this network/storage analogy must stop somewhere. Please someone from the storage camp tell us where exactly. Surely reading or writing a DVD image to disk does not/will not require millions of IOPS. Or will it still?
Posted Oct 5, 2010 11:13 UTC (Tue)
by axboe (subscriber, #904)
[Link]
Posted Oct 5, 2010 18:21 UTC (Tue)
by angdraug (subscriber, #7487)
[Link] (1 responses)
Have you seen this article at ArsTechnica? It goes to some lengths to explain the problems with IPv6 transition. If it's to be believed, IPv6 transition is quite far from "smooth".
Posted Oct 5, 2010 23:30 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
Yes but it would have been much worse (read: impossible) if IPv6 deployment ever required substantial changes to IPv4.
This is an interesting article. Except they are wrong when they pretend it is easy to break backward-compatibility with Ethernet or TCP. It is not easy but only "less impossible" than breaking IPv4 backward compatibility.
Note: the focus of the article is obviously neither on Ethernet nor on TCP.
Posted Oct 8, 2010 23:48 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
...
In comparison, the need for backward compatibility in storage is basically inexistent.
Well, the the whole reason SSDs exist is backward compatibility with rotating media, and it does slow things down considerably. If not for backward compatibility, we wouldn't use SCSI or even Linux block devices to access solid state storage. Write amplification by read-modify-write wouldn't be a problem if the device weren't trying to emulate a 512-byte-sectored disk drive.
Existence of SSDs tells me people aren't willing to replace the entire system at once -- they want to replace just the disk drives.
Not knowing the network issues, though, I can believe that backward compatibilty hinders performance less in storage than for ethernet.
Posted Oct 5, 2010 10:24 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (4 responses)
Posted Oct 5, 2010 10:49 UTC (Tue)
by hmh (subscriber, #3838)
[Link] (3 responses)
Posted Oct 5, 2010 10:57 UTC (Tue)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Oct 5, 2010 11:13 UTC (Tue)
by hmh (subscriber, #3838)
[Link]
Posted Oct 5, 2010 11:15 UTC (Tue)
by axboe (subscriber, #904)
[Link]
Posted Oct 5, 2010 15:54 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (10 responses)
Posted Oct 5, 2010 17:29 UTC (Tue)
by strappe (guest, #53440)
[Link] (9 responses)
I can easily imagine that flash will displace hard drives in most laptops and desktops, but server farms are still going to need massive amounts of cheap storage. Rotating media still has a huge lead in $/bit (100X) so I don't think it will be displaced in there any time soon.
Posted Oct 5, 2010 18:04 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (8 responses)
I was thinking more in terms of treating flash specifically as less like an "I/O" device and more like a slow memory. I have no doubt that spinning rust will be around for awhile--a decade or more at least. It just seems like wrapping the flash behind a "disk drive" abstraction in hardware puts some artificial upper limits on how well it can perform. It's acceptable with spinning rust because the electronics are so much faster. When you go all solid-state, it just feels like a bottleneck. Imagine what would happen if the immense creativity of the kernel crowd were unleashed on the problem of load balancing writes, erases and reads across a parallel array of raw flash modules? Approaches such as UBI/UBIFS sound rather promising. I generally like the idea of owning the problem in kernel space, where it seems like we ought to be able do much more deliberate and proactive scheduling.
Posted Oct 5, 2010 18:36 UTC (Tue)
by dlang (guest, #313)
[Link] (7 responses)
the requirement to do bulk deletes makes it far more like spinning disks than ram.
Posted Oct 5, 2010 19:27 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (6 responses)
It certainly is random access. I can generally send a command for address X followed by a command for address Y to the same chip, where the response time is not a function of the distance between X and Y, except when they overlap. Instead, the performance is most strongly determined by what commands I sent[*]. Reads are much faster than writes, and both are much, much faster than sector erase. The opposite is generally true of disks. There, the cost of an operation is more strongly determined by whether it triggered a seek (and how far the seek went) than if the operation was a read or a write. Both reads and writes require getting the head to a particular position on the platter, ignoring any cache that might be built into the drive. Also, under normal operation, spinning-rust drives don't really have an analog to "sector erase." (Yes, there's the old "low-level format" commands, but those aren't generally used during normal filesystem operation.) [*] Ok, so that's not 100% true, but essentially true in the current context. NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages. But, if you're doing block transfers, such as 512-byte sector reads, you're reading the whole page. Hopping between any two sectors always costs about the same. Here, read a data sheet! For this particular flash, a random sector read is 10us, sector write is 250us, and page erase is 2ms. The whole page-open/page-close architecture makes it look much more like modern SDRAM than disk.
Posted Oct 5, 2010 19:42 UTC (Tue)
by dlang (guest, #313)
[Link] (4 responses)
Posted Oct 5, 2010 20:38 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (3 responses)
You can do random writes to random empty sectors. Again, that's nothing like how a hard disk works. I'm still strenuously disagreeing with your earlier statement that flash's properties make it more like a disk than like RAM. It's really an entirely different beast worthy of separate consideration, which is why I think wrapping it up in an SSD limits its potential. With flash, you need entirely new strategies that apply neither to disks nor RAM to get the full benefit from the technology. Much of the effort spent on disks revolves (no pun intended) around eliminating seeks. No such effort is required with RAM or with flash. Flash does require you to think about how you pool your free sectors, though, and how you schedule writing versus erasing. I won't deny that. Rather, I say it only further invalidates your original conjecture that it makes flash more like disks. (I will agree it makes it less like RAM though.) Because seeks are "free", I could totally see load balancing algorithms of the form "write this block to the youngest free sector on the first available flash device", so that a new write doesn't get held up by devices busy with block erases. That looks nothing like what you'd want to do with a disk. It takes advantage of the "free seek" property of the flash while helping to hide the block erase penalty it imposes. Neither property is a property of a disk drive. Of course, neither property is a property of RAM, either. Am I splitting hairs over semantics here? Let me step back and summarize, and see if you agree: Raw flash's random access capability and relatively low access time can make it much more like RAM than disk, especially in terms of bandwidth and latency. Raw flash's limitations on writes, however, require the OS to have flash-specific write strategies. They prevent the OS from treating flash identically to RAM, and will require careful thought to be handled correctly. This is similar to how we had to put careful thought into disk scheduling algorithms, even if flash requires entirely different algorithms to address its unique properties.
Posted Oct 9, 2010 14:10 UTC (Sat)
by joern (guest, #22392)
[Link] (2 responses)
Intriguing. Can you elaborate a bit? What difference does it make vs. the naïve approach of erasing before writing?
Posted Oct 9, 2010 14:55 UTC (Sat)
by dlang (guest, #313)
[Link]
you also have the problem that erasing takes a significant amount of time and power to accomplish, so you don't want to wait until you need to erase to do so and you don't want to erase when you don't need to and are on battery
Posted Oct 9, 2010 15:03 UTC (Sat)
by jzbiciak (guest, #5246)
[Link]
Note: I'm not an expert. Please do not mistake me for one. :-) Here are my observations, though, along with things I've read elsewhere. Flash requires wear leveling in order to maximize its life. For the greatest effect, you want to wear level across the entire device, which means picking up and moving otherwise quiescent data so that each sector sees approximately the same number of erasures. That's one aspect. Another aspect is that erase blocks are generally much larger than write sectors. So, when you do erase, you end up erasing quite a lot. Furthermore erasure is about an order of magnitude slower than writing, and writing is about an order of magnitude slower than reading. For a random flash device whose data sheet I just pulled up, a random read takes 25us, page program takes 300us, and block erase takes 2ms. Pages are 2K bytes, whereas erase blocks are 128K bytes. (Warning: This is where I get speculative!) And finally, if you have multiple flash devices (or multiple independent zones on the same flash device), you can take advantage of that fact and the fact that "seeks are free" by redirecting writes to idle flash units if others are busy. That's probably the most interesting area to explore algorithmically, IMO. Given that an erase operation can take a device out of commission for 2ms, picking which device to start an erase operation on and when to do it can have a pretty big impact on performance. If you can do background erase on idle devices, for example, then you can hide the cost.
Posted Oct 7, 2010 12:38 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Oct 5, 2010 18:07 UTC (Tue)
by wingo (guest, #26929)
[Link] (6 responses)
I asked Michael Meeks a couple of Fosdems ago about how his iogrind disk profiler was coming, and he said that he totally dropped it, because ssds will kill all these issues. Sounds easier than fixing OpenOffice.org^WLibreOffice issues in code...
Is the "best practice" going to shift away from implementing things like GTK's icon cache and other purely seek-avoiding caches?
Posted Oct 5, 2010 22:29 UTC (Tue)
by zlynx (guest, #2285)
[Link] (5 responses)
GTK applications' current "best practice" of "ignore the RAM use, they can buy more" has already destroyed the usefulness of old hardware with a modern Linux software stack.
Posted Oct 6, 2010 0:16 UTC (Wed)
by mpr22 (subscriber, #60784)
[Link] (3 responses)
Posted Oct 6, 2010 1:23 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
yes we are doing more with our systems, but nowhere near that much more.
Posted Oct 6, 2010 9:23 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (1 responses)
(Here I am ignoring SSDs, still too new to be part of The History)
Posted Oct 6, 2010 11:04 UTC (Wed)
by dlang (guest, #313)
[Link]
in terms of size, drives have grown at least 1000x
in terms of sequential I/O speeds they have improved drastically (I don't think quite 1000x, but probably well over 100x, so I think it's in the ballpark)
in terms of seek time, they've barely improved 10x or so
this is ignoring things like SSDs, high-end raid controllers (with battery backed NVRAM caches) and so on which distort performance numbers upwards.
byt yes, the performance difference between the CPU registers and disk speeds is being stretched over time.
jut the difference in speed between the registers and ram is getting stretched to the point where people are seriously talking that it may be a good idea to start thinking of ram as a block device, accessed in blocks of 128-256 bytes (the cache line size for the CPU), right now the CPU hides this from you by 'transparently' moving the blocks in and out of the cache of the various processors for you so that if you choose to you can ignore this.
but when you are really after performance, a high end system starts looking very strange. You have several sets of processors that share a small amount of high-speed storage (L2/L3 cache) and have larger amount of lower speed storage (the memory directly connected to that CPU), plus a network to access the lower speed storage connected to other CPUs. Then you have a lower speed network to talk to the southbridge chipset to interact with the outside world (things like you monitor/keyboard/disk drives, PCI-e cards, etc).
This is a rough description of NUMA and the types of things that you can run into on large multi-socket systems, but the effect starts showing up on surprisingly small systems (which is why per-cpu variables and such things are used so frequently)
Posted Oct 14, 2010 19:29 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Three slots, max capacity 256Mb per slot, three 256Mb chips in the machine.
"That's no problem, they can just buy a new machine ..."
Cheers,
Posted Oct 5, 2010 18:48 UTC (Tue)
by jmm82 (guest, #59425)
[Link]
Posted Oct 5, 2010 21:17 UTC (Tue)
by jhhaller (guest, #56103)
[Link]
Posted Oct 6, 2010 22:05 UTC (Wed)
by eds (guest, #69511)
[Link]
At the extreme high end of PCIe SSDs, a system trying to do lots of small (4k) reads with high parallelism will be limited by having any queue locking at all. Running without a request queue remains an attractive option for these devices.
Another future improvement to watch out for is MSI-X interrupts. With MSI-X, it is possible to statically assign an interrupt to a single CPU core in such a way that an I/O retirement could interrupt the originating CPU directly; over about 600K IOPS it becomes important to spread out the interrupt/retirement workload as much as possible.
Posted Oct 9, 2010 0:00 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
There are so many ways to count "most" that this fact is pretty useless. Jens should just say, "some important I/O patterns are ...," which is reason enough to do this work.
I see a lot of thought wasted prioritizing things based on arbitrary "mosts": Most I/Os are reads, most files are under 4K, most computers are personal workstations.
Posted Oct 15, 2010 17:42 UTC (Fri)
by jmy3056 (guest, #70648)
[Link] (1 responses)
Media that stores electronic information that used to spin but now doesn't is a closer parallel with RAM. Optimzations for "disk" IO need to follow a similar path as OS/kernels when dealing with RAM.
Posted Oct 22, 2010 22:04 UTC (Fri)
by eds (guest, #69511)
[Link]
1. Addressing: DRAM is byte/word addressable; NAND flash is not. NAND flash pages are currently 4KB in size and must be read/written
If in a few more years phase-change memory becomes big and cheap enough to give NAND flash a run for its money, then it may be time to start treating nonvolatile memory sort of like DRAM. But that day isn't quite here yet.
Solid-state storage devices and the block layer
Still, for the moment, SSDs still contribute to the entropy pool by default; administrators who would like to change that behavior can do so by changing the queue/add_random sysfs variable.
Well, yes, but this isn't in any released kernel yet.
bogus random entropy sources
entropy when they do not. For example, Xen drivers are purely virtual
and therefore deterministic.
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
All VIA x86-compatible processors since the C3 (introduced 2003?) have included a hardware random number generator based on quantum effects; it produces millions of random bits each second, and is accessible with a non-privileged instruction. AFAIK, their opcode choice is unused by either AMD or Intel, so those companies could implement similar functionality (an infinitesimal bit of silicon) and we would have a standard solution at least across the x86 architecture going forward.
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
The AES accelerator probably lets them tick a required-feature box for some government programme or other.
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
>>> I don't understand why more processors don't include a proper hardware random number generator. It's a classic case of something that is significantly easier to do in hardware, I'd think.bogus random entropy sources
I think Intel will is working on this.
See these link : http://www.technologyreview.com/computing/25670/
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
bogus random entropy sources
> Core cryptographic functions are used in Windows to provide platform integrity as well as protection of user data.
(note the priorities)
bogus random entropy sources
Getting more entropy
I guess that the problem is to prove that an attacker cannot influence the timers so that the result is predictable. For example a guy on a different VM doing odd things with the same CPU. As it is hard to prove a negative statement of this kind, then people may tend to distrust such a source of entropy, even if it sounds really interesting.
Getting more entropy
Solid-state storage devices and the block layer
- Ready for 50,000 IOPS/s per disk?
+ >200,000 ctxsw/s per disk
+ 50,000 intr/s per disk
+ Does not scale to many disks
- Raw IOP capacity per HBA
+ will be a limiting factor
+ driver design will need to focus on IOPS optimisations,
not achieving max bandwidth
- CPU overhead will be high
- similar packet rates to gigabit ethernet per disk
many, many more interfaces than a typical network stack
- HBAs with multiple disks will have to handle packet rates
closer to 10Gb ethernet
- similar interrupt scaling tricks will be needed
+ MSI-X directed interrupts
+ one vector per disk behind the HBA?
+ polling rather than interrupt driven
o Not going to happen overnight
o Two orders of magnitude increase in performance is a big
disconnect
o Optimisations being made for current (cheap) SSDs have a
short life
- random write performance is not a limiting factor at
the high end....
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
See IPv6 for instance: it comes as a brand new feature practically not touching anything already in place, but even such a smooth "upgrade" is a hard sell!
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Probably the main reason why such an unfortunate IOPS jump has been forced in networking is backward compatibility.
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Is that really the case? I have trouble imagining that they are willing to drop support for rotational media quite this fast.
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
A "universal" memory technology has been the holy grail for decades: fast as SRAM, density and non-volatility of Flash, and cost of DRAM. There are various technologies that combine at least some of these characteristics: Magneto-resistive (MRAM), ferroelectric (FRAM), phase-change memory (PCM), programmable metalization cell (PMC) and resistive (RRAM). Whether any of these will be commercially viable is still unknown.
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
NAND flash has a notion of "sequential page read" versus "random page read". If you're truly reading random bytes a'la DRAM w/out cache, you'll see noticeably slower performance if the two reads are in different pages.
That sounds just like normal RAM: if you don't have to specify the row *and* column, you save on one CAS/RAS select cycle. Of course this is hidden behind the MMU and CPU cache management code and so on, so we don't often notice it, but it is there.
application impact
application impact
Eight Megabytes And Constantly Swapping. This is not a new phenomenon.
application impact
application impact
application impact
application impact
application impact
Wol
Solid-state storage devices and the block layer
iSCSI, Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
Solid-state storage devices: most I/O patterns
While workloads will vary, Jens says, most I/O patterns are dominated by random I/O and relatively small requests.
Solid-state storage devices and the block layer
Solid-state storage devices and the block layer
2. Flash management: flash sucks. It has long erase times, needs wear-leveling, needs lots of ECC and redundancy to be reliable. Dealing with flash requires a lot of careful management that nobody's going to want on a DRAM-like path.
3. Speed: flash is a lot faster than disk. But it's still a lot slower than DRAM (a write to a busy NAND part may have to wait up to 1ms).
4. Size: it's very expensive to try to address a terabyte of DRAM. 64-bit CPUs don't actually implement a full 64-bit address space. It's much cheaper to just address huge storage devices in blocks, like a disk.