NFS: the early years

June 20, 2022

This article was contributed by Neil Brown

I recently had cause to reflect on the changes to the NFS (Network File System) protocol over the years and found that it was a story worth telling. It would be easy for such a story to become swamped by the details, as there are many of those, but one idea does stand out from the rest. The earliest version of NFS has been described as a "stateless" protocol, a term I still hear used occasionally. Much of the story of NFS follows the growth in the acknowledgment of, and support for, state. This article looks at the evolution of NFS (and its handling of state) during the early part of its life; a second installment will bring the story up to the present.

By "state" I mean any information that is remembered by both the client and the server, and that can change on one side, thus necessitating a change on the other. As we will see, there are many elements of state. One simple example is file content when it is cached on the client, either to eliminate read requests or to combine write requests. The client needs to know when cached data must be flushed or purged so that the client and server remain largely synchronized. Another obvious form of state is file locks, for which the server and client must always agree on what locks the client holds at any time. Each side must be able to discover when the other has crashed so that locks can be discarded or recovered.

NFSv2 — the first version

Presumably there was a "version 1" of NFS developed inside Sun Microsystems, but the first to be publicly available was version 2, which appeared in 1984. The protocol is described in RFC 1094, though this is not seen as an authoritative document; rather, the implementation from Sun defined the protocol. There were other network filesystems being developed around the same time, such as AFS (the Andrew File System), and RFS (Remote File Sharing). One distinctive difference that NFS had, when compared to these, is that it was simple. One might argue that it was too simple, as it could not correctly implement some POSIX semantics. However, this simplicity meant that it could provide good performance for a lot of common workloads.

The early 1980s was the time of the "3M Computer" which suggested a goal for personal workstations of one megabyte of memory, one MIPS of processing power, and one megapixel (monochrome) of display. This seems almost comically underpowered by today's standard, particularly when one considers that a price tag of a mega-penny ($10,000) was thought to be acceptable. But this was the sort of hardware on which NFSv2 had to run — and had to run well — in order to be accepted. History suggests that it was adequate to the task.

Consequences of being "stateless"

The NFSv2 protocol has no explicit support for any state management. There is no concept of "opening" a file, no support for locking, nor any mention of caching in the RFC. There are only simple, self-contained access requests, all of which involve file handles.

The "file handle" is a central unifying feature of NFSv2: it is an opaque, 32-byte identifier for a file that is stable and unique within a given NFS server across all time. NFSv2 allows the client to look up the file handle for a given name in a given directory (identified by some other file handle), to inspect and change attributes (ownership, size, timestamps, etc.) given a file handle, and to read and write blocks of data at a given offset of a given file handle.

As far as possible, the operations chosen for NFSv2 are idempotent, so that, if any request were repeated, it would have the same result on the second or third attempt as it had on the first. This is necessary for true stateless operation over an imperfect network. NFS was originally implemented over UDP, which does not guarantee delivery, so the client had to be prepared to resend a request if it didn't get a reply. The client cannot know if it was the request or the reply that was lost, and a truly stateless server cannot remember if any given request has been seen already so that it can suppress a repeat. Consequently, when the client resends a request, it might repeat an operation that has already been performed, so idempotent operations are best.

Unfortunately, not all filesystem operations under POSIX can be idempotent. A good example is MKDIR, which should make a directory if the given name is not in use, or return an error if the name is already used, even if it is used for a directory. This means that repeating the request can result in an incorrect error result. Standard practice for minimizing this problem is to implement a Duplicate Request Cache (DRC) on the server. This is a record of recent, non-idempotent requests that have been handled, along with the result that was returned. Effectively, this means that both the client (which must naturally track requests that have not yet received a reply) and the server maintain a list of outstanding requests that changes over time. These lists match our definition of "state", so the original NFSv2 was not actually stateless in practice, even if it was according to the specification.

As the server cannot know when the client sees a reply, it cannot know when a request is no longer outstanding, so it must use some heuristics to discard old cache entries. It will inevitably remember many requests that it doesn't need to, and may discard some that will soon be needed. While this is clearly not ideal, experience suggests that it is reasonably effective for normal workloads.

Maintaining this cache requires that the server knows which client each request came from, so it needs some reliable way to identify clients. This is a need that we will see repeated as state management becomes more explicit with the development of the protocol. For the DRC, the client identifier used is derived from the client's IP address and port number. When TCP support was added to NFS, the protocol type needed to be included together with the host address and port number. As TCP provides reliable delivery, it might seem that the DRC is not needed, but this isn't entirely true. It is possible for a TCP connection to "break" if a network problem causes the client and server to be unable to communicate for an extended period of time. NFS is prepared to wait indefinitely, but TCP is not. If TCP does break the connection, the client cannot know the status of any outstanding requests, so it must retransmit them on a new connection, and the server might still see duplicate requests. To make sure this works, NFS clients are careful to reconnect using the same source port as the earlier connection.

A duplicate request cache is not perfect, partly because the heuristic may discard entries before the client has actually received the reply, and partly because it is not preserved across server reboots, so a request might be acted upon both immediately before and after a server crash. In many cases, this is an occasional inconvenience but not a huge problem; will anyone really suffer if "mkdir" occasionally returns EEXIST when it shouldn't? But there is one situation that turned out to be quite problematic and isn't handled by the DRC at all. That is exclusive create.

Before Unix had any concept of file locks (as it didn't in Edition 7 Unix, which became the base for BSD), it was common to use lock files. If exclusive access was required to some file, such as /usr/spool/mail/neilb, the convention was that the application must first create a lock file with a related name, such as /usr/spool/mail/neilb.lock. This must be an "exclusive" creation using the flags O_CREAT|O_EXCL, which would fail if the file already existed. An application that found that it couldn't create the file because some other application had done so already would wait and try again.

Exclusive create is not an idempotent operation — by design — and NFSv2 has no support for it at all. Clients could perform a lookup and, if that reported no existing file, they could then create the file. This two-step sequence is clearly susceptible to races, so it is not reliable. This failing of NFS does not appear to have decreased its popularity, but certainly resulted in a lot of cursing over the years. It also resulted in some innovation; there are other ways to create lock files.

One way is to generate a string that will be unique across all clients — possibly with host name, process ID, and timestamp — and then create a temporary file with this string as both name and content. This file is then (hard) linked to the name for the lock file. If the hard-link succeeds, the lock has been claimed. If it fails because the name already exists, then the application can read the content of that file. If it matches the generated unique string, then the error was due to a retransmit and again the lock has been claimed. Otherwise the application needs to sleep and try again.

Another unfortunate consequence of avoiding state management involves files that are unlinked while they are still open. POSIX is perfectly happy with these unlinked-but-open files and assures that the file will continue to behave normally until it is finally closed, at which point it will cease to exist. An NFS server, since it does not know which files are open on which client, finds it difficult to be so accommodating, so NFS client implementations don't rely on help from the server. Instead, when handling a request to unlink (remove) a file that is open, the client will instead rename the file to something obscure and unique, like .nfs-xyzzy, and will then unlink this name when the file is finally closed. This relieves the server from needing to track the state of the client, but is an occasional inconvenience to the client. If an application opens the only file in some directory, unlinks the file, then tries to remove the directory, that last step will fail as the directory is not empty but contains an obscure .nfs-XX name — unless the client moves the obscure name into a parent or converts the RMDIR into another rename operation. In practice this sequence of operations is so rare that NFS clients don't bother to make it work.

The NFS ecosystem

When I said above that NFSv2 didn't support file locking, that is only half the story — it is accurate but not complete. NFS was, in fact, part of a suite of protocols that could be used together to provide a more complete service. NFS didn't support locks, but there was another protocol that did. The protocols that could be used with NFS include:

NLM (the Network Lock Manager). This allows the client to request a byte-range lock on a given file (identified using an NFS file handle), and allows the server to grant it (or not), either immediately or later. Naturally this is an explicitly stateful protocol, as both the client and server must maintain the same list of locks for each client.
STATMON (the Status Monitor). When a node — whether client or server — crashes or otherwise reboots, any transient state, such as file locks, is lost, so its peer needs to respond. A server will purge the locks held by that client, while a client will try to reclaim the locks that were lost. The chosen method with NLM is to have each node record a list of peers in stable storage, and to notify them all when it reboots; they can then clean up. This task of recording and then notifying peers was the job of STATMON. Of course, if a client crashed while holding a lock and never rebooted, the server would never know that the lock was no longer held. This could, at times, be inconvenient.
MOUNT. When mounting an NFSv2 filesystem, you need to know the file handle for the root of the filesystem, and NFS has no way to provide this. The task is handled instead by the MOUNT protocol. This protocol expects the server to keep track of which clients have mounted which filesystems, so this useful information can be reported. However, as MOUNT doesn't interact with STATMON, clients can reboot and effectively unmount filesystems without telling the server. While implementations do still record the list of active mounts, nobody trusts them.

In later versions, MOUNT also handled security negotiations. A server might require some sort of cryptographic security (such as Kerberos) for accessing some filesystems, and this requirement is communicated to the client using the MOUNT protocol.
RQUOTA (remote quotas). NFS can report various attributes of files and of filesystems, but one attribute that is not supported is quotas — possibly because these are attributes of users, not of files. To fill this gap for people who need it to be filled, there exists the RQUOTA protocol.
NFSACL (POSIX draft ACLs). Much as we have RQUOTA for quotas, we have NFSACL for access control lists. This allows both examining the ACLs and (unlike RQUOTA) setting them.

Beyond these, there are other protocols that are only loosely connected, such as "Yellow Pages", also known as the Network Information Server (NIS), which helped a collection of machines have consistent username-to-UID mappings; "rpc.ugid", which tried to help out when they didn't; and maybe even NTP which ensures that an NFS client and server had the same idea of the current time. These aren't really part of NFS in any meaningful sense, but are part of the ecosystem that allowed NFS to flourish.

NFSv3 — bigger is better.

NFSv3 came along about ten years later (1995). By this time, workstations were faster (and more colorful) and disk drives were bigger. 32 Bits were no longer enough to represent the number of bytes in a file, blocks in a filesystem, or inodes in a filesystem, and 32 bytes were no longer enough for a file handle, so these sizes were all doubled. NFSv3 also gained the READDIRPLUS operation to receive the names in a directory together with file attributes, so that ls -l could be implemented more efficiently. Note that deciding when to use READDIRPLUS and when to use the simpler READDIR is far from trivial. The Linux NFS client is still, in 2022, receiving improvements to the heuristics.

There were two particular areas of change that relate to state management, one which addressed the exclusive-create problem discussed above, and one which helped with maintaining a cache of data on the client. The first of these extended the CREATE operation.

In NFSv3, a CREATE request can indicate whether the request is UNCHECKED, GUARDED, or EXCLUSIVE. The first of these allows the operation to succeed whether the file already exists or not. The second must fail if the file exists, but it is like MKDIR in that a retransmission may result in an error where there shouldn't be one, so it is not particularly helpful. EXCLUSIVE is more interesting.

The EXCLUSIVE create request is accompanied by eight bytes of unique client identification (our recurring theme) called a "verifier". The RFC (RFC 1813) suggests that "perhaps" this verifier could contain the client's IP address or some other unique data. The Linux NFS client uses four bytes of the jiffies internal timer and four bytes of the requesting process's process ID number. The server is required to store this verifier to stable storage atomically while creating the file. If the server is later asked to create a file which already exists, the stored client identifier must be compared with that in the request and, if they match, the server must report a successful exclusive creation on the assumption that this is a resend of an earlier request.

The Linux NFS server stores this verifier in the mtime and atime fields of the file it creates. The NFSv3 protocol acknowledges this possibility and requires that, once the client receives the reply indicating successful creation, it must issue a SETATTR request to set correct values for any file attributes that the server might have overloaded to store the verifier. This SETATTR step acknowledges to the server the completion of some non-idempotent request — exactly what we thought might have been helpful for the DRC implementation.

Client-side caching and close-to-open cache consistency

The NFSv2 RFC did not describe client-side caching, but that doesn't mean that implementations didn't do any. They had to be careful though. It is only safe to cache data if there is good reason to expect that the data hasn't changed on the server. NFS practice provides two ways for the client to convince itself that cached data is safe to use.

The NFS server can report various attributes of a file, particularly size and last-change time. If neither of these change from previously seen values, it might be reasonable to assume that the file content hasn't changed. NFSv2 allows the change timestamp to be reported to the closest microsecond, but that doesn't guarantee that the server maintains that level of precision. Even twenty years after NFSv2 was first used, there were important Linux filesystems that could only report one-second granularity for time stamps. So, if an NFS client sees a timestamp that is at least one second in the past, and then reads data, it is safe to cache that data until it sees the timestamp change. If it sees a timestamp that is within one second of "now", then it is much less safe to make assumptions.

NFSv3 introduced an FSINFO request that allowed the server to report various limits and preferences, and included a "time_delta", which is the time granularity that can be assumed for change time and other timestamps. This allows client-side caching to be a little more precise.

As noted above, it is considered safe to use cached data for a file until its attributes are seen to change. The client could choose never to look at the file attributes again and, thus, never see a change, but that is not permitted. The way to affirm data safety consists of two rules about when the client must check the attributes.

The first rule is simple: check occasionally. The protocol doesn't specify minimum or maximum timeouts but most implementations allow these to be configured. Linux defaults to a three-second timeout which is extended exponentially as long as nothing appears to be changing, to a maximum of one minute. This means that the client might provide data from cache that is up to 60 seconds old, but no longer. The second rule builds on an assumption that multiple applications never open the same file at the same time, unless they use locking or they are all read-only.

When a client opens a file, it must verify any cached data (by checking timestamps) and discard any that it cannot be confident of. As long as the file remains open, the client can assume that no changes will happen on the server that it doesn't request itself. When it closes the file, the client must flush all changes to the server before the close completes. If each client does this, then any application that opens a file will see all changes made by any other application on any client that closed the file before this open happened, so this model is sometimes referred to as "close-to-open consistency".

When byte-range locking is used, the same basic model applies, but the open operation becomes the moment when the client is granted a lock, and the close is when it releases the lock. After being granted a lock, the client must revalidate or purge any cached data in the range of the lock and, before releasing a lock, it must flush cached changes in this region to the server.

As the above relies on the change time to validate the cache, and as the change time updates whenever any client writes to the file, the logical implication is that, when a client writes to a file, it must purge its own cache since the timestamp has changed. It is quite justified to maintain the cache until the file is closed (or the region is unlocked), but not beyond. This need is particularly visible when byte-range locking is used. One client might lock one region, write to it, and unlock. Another client might lock, write, and unlock a different region, with the write requests happening at exactly the same time. There is no way that either client can tell if another client wrote to the file or not, as the timestamp covers the whole file, not just one range. So they must both purge their whole cache before the next time the file is opened or locked.

At least, there was no way to tell before NFSv3 introduced weak cache consistencies (wcc) attributes. The reply to an NFSv3 WRITE request allows the server to report some attributes — size and time stamps — both before and after the write request, and requires that, if it does report them, then no other write happened between the two sets of attributes. A client can use this information to detect when a change in timestamps was due purely to its own writes, and when they were due to some other client. It can, thus, determine whether it is the only client writing to a file (a fairly common situation) and, when so, preserve its cache even though the timestamp is changing. Wcc attributes are also available in replies to SETATTR and to requests that modify a directory, such as CREATE or REMOVE, so a client can also tell if it is the sole actor in a directory, and manage its cache accordingly.

This is "weak" cache consistency, as it still requires the client to check the timestamps occasionally. Strong cache consistency requires the server to explicitly tell the client that change is imminent, and we don't get that until a later version of the protocol. Despite being weak, it is still a clear step forward in allowing the client to maintain knowledge about the state of the server, and so another nail in the coffin of the fiction of a stateless protocol.

As an aside, the Linux NFS server doesn't provide these wcc attributes for writes to files. To do this, it would need to hold the file lock while collecting attributes and performing the write. Since Linux 2.3.7, the underlying filesystem has been responsible for taking the lock during a write, so nfsd cannot provide the attributes atomically. Linux NFS does provide wcc attributes for changes to directories, though.

NFS — the next generation

These early versions of NFS were developed within Sun Microsystems. The code was made available for other Unix vendors to include in their offerings and, while these vendors were able to tweak the implementation as needed, they were not able to change the protocol; that was controlled by Sun.

As the new millennium approached, interest in NFS increased and independent implementations appeared. This resulted in a wider range of developers with opinions — well-informed opinions — on how NFS could be improved. To satisfy these developers without risking dangerous fragmentation, a process was needed for those opinions to be heard and answered. The nature of this process and the changes that appeared in subsequent versions of the NFS protocol will be the subject of a forthcoming conclusion to this story.

Index entries for this article
Kernel	Filesystems/NFS
GuestArticles	Brown, Neil

NFS: the early years

Posted Jun 20, 2022 23:27 UTC (Mon) by willy (subscriber, #9762) [Link] (5 responses)

In some ways, NFS is an example of worse-is-better. While UDP isn't reliable, in practice NFS sucked too much over a WAN, and didn't support federated UIDs anyway, so it was used on LANs, which meant a reliable Ethernet physical layer.

The very notion of a stateless filesystem is ridiculous. Filesystems exist to store state. How tightly coupled the client & server are and how much the client and server trust each other are legitimate areas for discussion.

For those who don't know, I once wrote an NFS 2/3 server in ARM assembler. It was the nineties ...

NFS: the early years

Posted Jun 21, 2022 11:03 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Well, it's not a stateless filesystem, is it. It's a stateless *protocol* (implementing a filesystem which is of course stateful). This was useful back in the day when you had one fileserver and about ten thousand clients on massively collision-prone university thicknet networks that ran at the speed of a degraded dog[1]; the fileserver could never have managed to keep track of anything nontrivial about all those clients, so it was damn good it didn't have to. These days that sort of environment is more or less history...

[1] I remember quite late in this era, in 1995, finding that 97% of all packets on my university comp sci lab's 10Mb/s Ethernet appeared to be retransmissions... that network was *unusable*.

NFS: the early years

Posted Jun 22, 2022 10:17 UTC (Wed) by donaldh (subscriber, #151569) [Link]

Ah yes, my comp-sci lab was similarly hosed. Not helped at all by the diskless Sun Sparc SLCs that had root fs and swap mounted via NFS.

NFS: the early years

Posted Jun 21, 2022 12:43 UTC (Tue) by ballombe (subscriber, #9523) [Link]

In 1995, stateless means that any client or server node could reboot without corrupting the others,
but also that 'append' was emulated by truncate + rewrite, which was very inefficient.
In practice, the directory cache would fill the server memory about once a week at my site, so a cronjob was setup to reboot it every night. This was a good remainder to go to sleep since otherwise, you would
get "NFS server not responding still trying" for about 20 minutes...

But still, NFS in 1995 was pretty neat. The fact that we still speak about this 1984 technology today speak volume.

NFS: the early years

Posted Jun 21, 2022 16:29 UTC (Tue) by Sesse (subscriber, #53779) [Link] (1 responses)

My workstation used to be rootless, with NFS root. Once, I took it to a site 8 ms away (on gigabit Ethernet all the way, effectively zero congestion). I remember it took more than half an hour to boot…

NFS: the early years

Posted Jun 22, 2022 8:23 UTC (Wed) by geert (subscriber, #98403) [Link]

I ran the second stage of debootstrap on a dev board in JP using nfsroot, connected to my NFS server (in EU) using TAP/TUN over SSH. Took all night to complete, but it did work.
Subsequent boot ups indeed took ca. 30 minutes, so it looks like 8 ms or 300 ms don't seem to make much of a difference?

NFS: the early years

Posted Jun 21, 2022 2:14 UTC (Tue) by lathiat (subscriber, #18567) [Link] (8 responses)

My colleague Jay only recently told me about the dangerous fragmentation issue alluded to at the end which I found totally fascinating which I'd not heard of despite using NFS since before 2007. I look forward to that part of the story.

NFS: the early years

Posted Jun 21, 2022 5:54 UTC (Tue) by pwfxq (subscriber, #84695) [Link] (5 responses)

Talk of NFS and fragmentation brings back bad memories of mixed FDDI/Ethernet networks and having to manually set the MTU everywhere.

NFS: the early years

Posted Jun 21, 2022 20:00 UTC (Tue) by dublin (guest, #114125) [Link] (4 responses)

I was one of the architects of Chevron's IP network in the early 90s. I was puzzled when one of our customer groups requested an SGI box for their NFS server, since I was pretty familiar with the workings of NFS and knew that Sun's implementation *should* be superior (I'd compared them in the past).

At first I thought this group was falling for a bunch of SGI sales BS, but they still had the demo box in place on our network, and sure enough, it flat left our best Sun and DEC Ultrix NFS servers in the dust - both of which were were notably faster than the IBMs and HPs of the day. But how?

This set a group of us protocol performance jocks on a quest to get to the bottom of why, on the same networks, SGI could deliver such dramatically better NFS performance. (Only FDDI was 100 Mbps then: All our Ethernet was 10 Mbps, and though we were playing with the first Kalpana Etherswitch, that wasn't in play here, and we were testing the servers and clients on the same segment.) One thing we noticed after a day of poring over network analyzer data was that performance was a bit burstier than normal, and throughput was great, but that the overall network utilization was actually a bit *less*. SGI was proud of their newfound performance, but it was pretty clear that their SEs had no clue *why* it got so much better all of a sudden. Curiouser and curiouser...

We finally noticed that it wasn't just bursty - all SGI responses were back-to-back: SGI was cheating - and it turns out, quite elegantly. They were managing to deliver complete NFS blocks all at once, years before jumbo frames were even a thing. (NFSv2 blocks were 8K, vs the 1500 byte MTU of all Ethernet at the time.) What made the SGI such an NFS screamer was that they brilliantly violated the Ethernet standard, with no real significant downside: They simply sent all six frames of an NFS block out one after another, with NO chance for anyone to interrupt - they had implemented stateful semantics into their Ethernet driver for NFS serving! When sending one of the up to six frames in an NFS block, the last byte of one frame was *immediately* followed by the preamble for the next one, with no silence or backoff as would normally be required by the spec. This eliminated the potential for Ethernet collisions between these frames, since all other nodes would always lose and have to back off! Since no other node ever had the chance to interrupt, the effect was that even prior to jumbo frames, the server was delivering an entire NFS block as a train, and the impact on the network from collisions and retransmissions was greatly reduced overall, not to mention that NFS performance to the clients was much faster.

All in all, it was one of the cleverest network protocol performance hacks I've ever seen, and certainly worked extremely well in optimizing NFS performance over those old style Ethernet networks. Only a few years later, we had 100 Mbps Ethernet, Jumbo frames, and cut-through switching a la Kalpana was mainstream, so this killer hack was only known to a few, and vanished without a trace. (As far as I know, SGI never fessed up to this, I think because they feared the potential backlash of enterprise customers who couldn't stomach the idea of a vendor violating the sacrosanct 802.3 standard, even if doing so was a win-win in this case...)

NFS: the early years

Posted Jun 21, 2022 23:52 UTC (Tue) by jwarnica (subscriber, #27492) [Link] (2 responses)

This would only decrease collisions if everyone else was unnecessary-out-of-spec level good at the "carrier sense" part of the paradigm. Which, of course, is statistically unlikely at scale.

So I'm a tad confused....

NFS: the early years

Posted Jun 22, 2022 8:24 UTC (Wed) by ewen (subscriber, #4772) [Link] (1 responses)

If you transmit frames back to back with no gap, then other nodes doing carrier sense would still be hearing your earlier transmission (previous frame) as you start the next one. So ordinary carrier sense plus delay to transmit would cause other nodes not to talk over you, as they’d never observe a quiet period before the (second or subsequent) frame started being sent.

Obviously they could still talk over the first frame as normal, as they wouldn’t hear the start of the transmission for a while (hence minimum frame lengths, so the sender can learn overtalk happened).

But the result should be either you lose the first frame due to overtalk (and don’t send the rest), or you send all frames before anyone else gets a word in, without interruption/overtalk. On a NFS heavy, reasonably congested network it’s probably a net win for everyone over constantly losing one in N frames out of a larger NFS request/reply (and thus endless retries tying up the shared medium).

Seems like a very clever hack, if a bit “unfair”/“greedy” of a shared medium if the “fake jumbo frames” aren’t reasonably well separated from more “fake jumbo frames”.

Ewen

NFS: the early years

Posted Jun 22, 2022 21:48 UTC (Wed) by dublin (guest, #114125) [Link]

Yep, that's it. It *was* a bit unfair/greedy, but in a way that actually made throughput better for everyone, since NFS was a fair portion of the traffic (with X being a large part of the rest - it didn't impact that enough to be a real concern.

Also, we never had all that many SGIs serving NFS, so I have no idea if this might have fallen over at a larger scale...)

BTW, this technique eliminated NFS-related collisions and backoffs, but that doesn't impact things nearly as much as people think - CSMA/CD is really pretty efficient: If you assume that *every other packet* (50%, crazy high) collided, you *still* got 97% of Ethernet's throughput, even with that many backoffs. Do the math if you don't believe me...) I think it was mostly fast because it kept the NFS pipeline flowing on both client and server...

NFS: the early years

Posted Jun 22, 2022 8:18 UTC (Wed) by geert (subscriber, #98403) [Link]

I guess this made the situation even worse with some of the early PC Ethernet cards, which didn't have enough RAM to hold 8 KiB worth of packets in their receive buffer, and/or couldn't handle back-to-back packets?

NFS: dangerous fragmentation

Posted Jun 23, 2022 17:39 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

My colleague Jay only recently told me about the dangerous fragmentation issue alluded to at the end

Neil appears to be talking about fragmentation of the protocol and the developer community, whereas you seem to be talking about something technical.

NFS: dangerous fragmentation

Posted Jun 23, 2022 21:57 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> Neil appears to be talking about fragmentation of the protocol and the developer community, whereas you seem to be talking about something technical.

Indeed. The only technical issue that I know of which involves NFS and fragmentation relates to UDP packets being fragmented into IP packets, and not necessarily being assembled properly.
See "man 5 nfs" and the section titled "Using NFS over UDP on high-speed links".
You can find that man page at http://man.he.net/man5/nfs if you don't want to leave your browser just now.
Or maybe https://www.man7.org/linux/man-pages/man5/nfs.5.html but that site isn't responding for me just now.

NFS: the early years

Posted Jun 21, 2022 4:59 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

A couple of years ago I implemented a simple NFSv4 client in Go ( https://github.com/Cyberax/go-nfs-client ) for AWS EFS, and I was really surprised by how _easy_ it was to do. In fact, almost all of it was written within just 1 day, while I was anxiously waiting for the election results.

NFS is a really great example of how to create simple and robust distributed systems.

NFS: the early years

Posted Jun 21, 2022 14:01 UTC (Tue) by amcrae (guest, #25501) [Link] (1 responses)

One of the key aspects of NFS that set it apart from a variety of other contenders for a distributed filesystem at the time was the consistent naming that allowed it to integrate cleanly as part of a normal filesystem hierarchy.
Other protocols required special names (one used "..." as a special path to access the remote filesystem).
NFS allowed you to mount a remote filesystem anywhere in the hierarchy and let it appear as part of the normal directory structure with no special naming or other special characteristics. This meant that nearly all programs would work with no changes.
NFS as a protocol wasn't perfect, but is was a heck of a lot better at the time than most other alternatives.
I also remember PC-NFS, which was a package that allow simple-minded IBM-PCs running DOS to access NFS. A better option than IPX if you had a mostly Unix environment at the time.
I always think of NFS as a great example of something that wasn't perfect, but it solved 95% of the most critical part of the problem at the time.

NFS: the early years

Posted Jun 21, 2022 20:18 UTC (Tue) by dublin (guest, #114125) [Link]

Sun's TOPS protocol (which ran on Suns, PCs and Macs) was also quite good, but was perceived as a bit expensive and never really got much traction outside 1990-ish Sun shops... IIRC, it came not only with really nice cross platform file sharing, but even email that could be gatewayed to Internet mail - pretty serious stuff in the days of Novell everywhere.

FWIW, Novell's NCP is perhaps the best file-sharing protocol architecture I've ever encountered. Its protocol design was so latency-tolerant that I ran it well enough over 56K geosync INMARSAT satellite connections in 1993 to be quite usable - a feat impossible with NFS - and believe me, I tried! (The application was emergency spill reponse for the oil companies - the entire project brief was more or less two sentences: Within 15 minutes of hitting a spill site anywhere in the world, we must have voice, data, and filesharing connectivity back to Houston. Oh, and your budget is 1/10th of what the satellite equipment vendor's experts want, everything must be checkable as luggage, and are no skilled IT people will be there - so it just has to work. We nailed it.)

NFS: the early years

Posted Jun 21, 2022 23:44 UTC (Tue) by jwarnica (subscriber, #27492) [Link] (7 responses)

NFS is proof-of-failure #1 for the basic theory of RPC, that is, that without thinking about it you can put a C level (that is, asm level) subroutine call across the network without any further thought.

This put "the network is the computer" behind schedule by a decade, and triggered people to fix the idea by double downing on the idea of ignoring the network in the form of CORBA, which contributed another decade of schedule slip.

Except as examples of what not to do, good riddance.

I'm sorry for the sysadmins who had to deal with this. I'm even more sorry for those who let it escape the Valley.

Remote filesystems: a lost cause?

Posted Jul 9, 2022 21:56 UTC (Sat) by marcH (subscriber, #57642) [Link] (6 responses)

Indeed the https://en.wikipedia.org/wiki/Fallacies_of_distributed_co... come to mind.

Does _any_ network filesystem "suck less"? The entire concept seems like a lost cause. Sure, a samba share is marginally more convenient than rsync or wget stuff.tgz for some occasional recursive _download_ but that's still just glorified FTP. Is there any actual and successful use case where multiple clients are actively working on the same, shared tree at the same time?

Just for fun try to time "git status" over a network file system and compare with a local file system. The thing that relies a lot on "state" is caching and caching is what has made computers faster and faster. Is high-performance caching compatible with sharing access over the network? It does not look like it.

I'm more and more convinced that the future belongs to application-specific protocols. The poster child is of course git which is capable of synchronizing Gigabytes while transferring kilobytes. Of course there is a price to pay: the synchronization requires explicit user action, remote resources do not magically appear like their local. But that's just acknowledging Network Fallacies and reality.

PS: NUMA and CXL seem similarly... "ambitious"

Remote filesystems: a lost cause?

Posted Jul 9, 2022 22:57 UTC (Sat) by atnot (subscriber, #124910) [Link]

> Does _any_ network filesystem "suck less"? The entire concept seems like a lost cause.

I think it gets worse, because all filesystems are secretly network filesystems. It doesn't really matter that much whether the two computers speak to each other over ethernet, SCSI or PCIe, you just notice it less with lower latency. Basing persistent state storage on what amounts to a shared, globally read-writable single address space without any transaction or really, any well defined concurrency semantics is, I think, a fundamental dead end. See also the symlink discussion. So much effort is thrown into pretending files work like they did on a pdp11 at a very deep level, and I think it's really something that needs to be moved beyond. Git is actually a pretty good example I never thought of there, in the way it sort of emulates a traditional filesystem structure on top of a content-addressed key-value blob store.

Remote filesystems: a lost cause?

Posted Jul 10, 2022 6:04 UTC (Sun) by donald.buczek (subscriber, #112892) [Link] (4 responses)

NFS doesn't suck.

We have ~400 Linux systems ranging from workstations to really big compute servers using a global namespace supported by autofs and nfs for /home and all places where our scientific data goes (/project). We even have software packages installed and used over nfs ( /pkg ) because, to make our data processing reproducible, we keep every version of every library or framework and that would be too much to keep on every system. Only base Linux system, copies of the current/default version of some highly used packages (e.g. python) and some scratch space are on local disks.

We have ~500 active users and they usually don't complain about perfomance or responsiveness of their desktop.

We do this for decades and it used to be a pain, but with the progress of NFS and the steps we've taken here (e.g. replace NIS), everything runs really smoothly these days! If it doesn't, some user acted against the rules (e.g. hammering a fileserver from many distributed cluster jobs or trying to up- or download a few terrabytes without a speed limit) and we have a talk.

Oh, and the same namespace ( /home, /project ) is accessed over CIFS by our Windows and macOS workstation. Plus the files are accessed locally on the fileservers, too. For example, we typically would put daemons and cronjobs, which don't need much ram or cpu, on the fileserver where their projects are local, to reduce network dependency. Also users are allowed to log in to the HPC node where their job is executing to monitor or debug its behavior.

And (related to other discussions) we couldn't live without symlinks. And all these systems are multi-user, security border between users is basic requirement.

NFS, symlinks and multiuser are not dead, "Nowadays things are done like this and that" might be true in a statistical sense but should not be generalized.

Remote filesystems: a lost cause?

Posted Jul 10, 2022 7:40 UTC (Sun) by marcH (subscriber, #57642) [Link]

Thanks! So mostly reads, no read/write concurrency and rarely in the critical path? Then sure, as long as you have the tools to monitor and catch "deviant" users then why not. Not exactly plug and play / "consumer"-friendly though and quite far from the illusion of a local disk.

Remote filesystems: a lost cause?

Posted Jul 18, 2022 16:08 UTC (Mon) by ajmacleod (guest, #1729) [Link] (2 responses)

Out of interest, what did you replace NIS with? I likewise have found NFS indispensable but NIS has hung around a lot longer than it should have in some use cases because for them it just works and is very simple (yes, too simple.)

Remote filesystems: a lost cause?

Posted Jul 18, 2022 20:15 UTC (Mon) by donald.buczek (subscriber, #112892) [Link] (1 responses)

We use self-developed tools to broadcast /etc files like passwd, group, exports or autofs-maps to all systems [1]. For shadow, we replaced nis with a self-develooped nss service, which queries a central server via tls [2].

[1]: https://github.molgen.mpg.de/mariux64/mxtools/blob/master...

[2]: https://github.molgen.mpg.de/mariux64/mxshadow

Remote filesystems: a lost cause?

Posted Jul 22, 2022 15:48 UTC (Fri) by ajmacleod (guest, #1729) [Link]

That's very interesting, thank you for the reply.

NFS: the early years

Posted Jun 23, 2022 3:04 UTC (Thu) by droundy (subscriber, #4559) [Link]

> If an application opens the only file in some directory, unlinks the file, then tries to remove the directory, that last step will fail as the directory is not empty but contains an obscure .nfs-XX name — unless the client moves the obscure name into a parent or converts the RMDIR into another rename operation. In practice this sequence of operations is so rare that NFS clients don't bother to make it work.

Wow, that is an amazing level of sarcasm! If there was one thing that is memorable about NFS, it's the nuisance of perpetual failing `rm -rf` and then trying to track down the darn progress holding a file open.

NFS: the early years

Posted Jun 24, 2022 13:26 UTC (Fri) by bfields (subscriber, #19510) [Link] (1 responses)

A lot of the original papers are readable and interesting, by the way. Two off the top of my head:

http://www.cs.siue.edu/~icrk/514_resources/papers/sandber... also introduces the VFS: "In order to build the NFS into the UNIX 4.2 kernel in a user transparent way, we decided to add a new interface to the kernel which separates generic filesystem operations from specific filesystem implementations."

https://archive.org/details/1989-conference-proceedings-w... starting p. 53 (there has to be a better link) introduces the DRC; it's interesting that the correctness improvements are described as a "beneficial side effect", with the main purpose increased bandwidth.

NFS: the early years

Posted Jun 25, 2022 11:17 UTC (Sat) by neilbrown (subscriber, #359) [Link]

The article at that second link contains the gem:

> duplicate request processing can result in incorrect results (affectionately called ‘filesystem corruption’ by those not in a filesystem development group).

That's the sort of witticism that would be right at home here in lwn.net.
It goes on to describe a problem which I've heard of before described as "Nulls Frequently Substituted".