[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

NFS: the new millennium

June 24, 2022

This article was contributed by Neil Brown

The network filesystem (NFS) protocol has been with us for nearly 40 years. While defined initially as a stateless protocol, NFS implementations have always had to manage state, and that need has been increasingly built into the protocol over successive revisions. The early days of NFS were discussed, with a focus on state management, in the first part of this series. This article completes the job with a look at the evolution of NFS since, approximately, the beginning of this millennium.

The early days of NFS were controlled by Sun Microsystems, the originator of the NFS protocol and author of both the specification and implementation. As the new millennium approached, interest in NFS increased and independent implementations appeared. Of particular relevance here are the implementations in the Linux kernel that drew my attention — particularly the server implementation — and the Filer appliance produced and sold by Network Appliance (NetApp). The community's interest in NFS extended as far as a desire to have more say in the further development of the protocol. I do not know what negotiations happened, but happen they did, and one clear outcome is documented for us in RFC 2339, wherein Sun Microsystems agreed to assign to The Internet Society certain rights concerning the development of version 4 (and beyond) of NFS, providing this development achieved "Proposed Standard" status within 24 months, meaning by early 2000. That particular deadline went wooshing past and was extended. We got a "Proposed Standard" in late 2000 with RFC 3010, which was revised for RFC 3530 in April 2003 and again for RFC 7530 in March 2015.

The IETF working group tasked with this development was mostly driven by Sun and NetApp; it had two co-chairs, one from each company, and most of the authors listed on the RFC are from these companies. My memory of these discussions is that there was quite a long list of perceived needs but no shared vision of a coherent whole. The impending (and changing) deadline drove a desire to get something out, even if it wasn't perfect. Consequently NFSv4 — particularly this first attempt which is now referred to as NFSv4.0 — felt to me like useful pieces that had been glued together, rather than carefully woven into a fabric; the elegance that I could see in NFSv2 was gone.

NFSv4 brought all of the various protocols that we saw before into one single protocol with one single specification. While the "tools" approach can be extremely powerful and is great for building prototypes, there usually comes a time when the strength provided by integration outweighs the agility provided by discrete components — for NFS, version 4 was that time. Support for access-control lists (ACLs), quota-tracking, security negotiation, namespace management, byte-range locking, and status management were all brought together, often in quite different forms than in their original separate incarnations. Of all these many changes, I want to focus on just two areas that have implications for the management of shared state.

The change attribute and delegations for cache consistency

As we have already seen, timestamps are not ideal for tracking when file content has changed on the server. Even if the client knows that timestamps are reported with some high precision, it cannot know how many write requests can be processed in that unit of time. So a timestamp is, at best, a useful hint. The designers of NFSv4 wanted something better, so they introduced the "change" attribute, sometimes called a "changeid". This is a 64-bit number that must be increased whenever the object (such as a file or directory) changes in any way.

This changeid is a mandatory aspect of the protocol so, for several years, the Linux NFS server was noncompliant since no Linux filesystem could provide a conforming changeid. This was fixed in Linux 2.6.31, but only for ext4, with XFS following in v3.11. For filesystems that don't provide an i_version, the Linux NFS server lies and uses the inode's change time instead, which may not provide the same guarantees.

The wcc (weak cache consistency) attribute information that NFSv3 introduced is preserved in NFSv4, but only for directories, though it uses the changeid rather than the change time and, strangely, is not provided for the SETATTR operation. Wcc attributes are not provided for files. It is still possible to get "before" and "after" attributes for WRITE requests, as every NFSv4 request is a COMPOUND procedure comprising a sequence of basic operations, and this can contain the sequence GETATTR, WRITE, GETATTR. However, these are not guaranteed to be performed atomically, so some other client could also perform a WRITE between the two GETATTR calls. If the difference between the "before" and "after" changeids is precisely one, it should be safe to assume no intervening changes, but the protocol specification doesn't make this explicit. Instead, NFSv4 provides delegations.

A delegation (or more specifically a "read delegation") is a promise made by the server to the client that no other client (or local application on the server) can write to the file. The server will proactively recall the delegation before any conflicting open request will be allowed to complete. While a client holds a delegation, it can be sure that all changes made on the server were made by itself, so it doesn't even need to check the changeid to ensure that its cache remains accurate. This provides a strong cache-coherency guarantee.

So, providing that the server offers a read delegation whenever a client opens a file that no one else is writing to, caching is easy. Exactly when the server should do that is not entirely clear; there is a cost in offering delegations since they need to be recalled when the file is opened for write access, and this can delay the open request.

Note that the server can also offer a "write delegation" if no other clients or applications have the file open for either read or write. It is not clear to me how useful this really is. The most obvious theoretical benefits are that writes do not need to be flushed before a file is closed, and that byte-range locks do not need to be notified to the server. Whether these benefits are practical is less obvious. The Linux kernel's NFS server never offers write delegations.

clientids, stateids, seqids, and client state management

As mentioned, NFSv4 integrates byte-range locking and thus needs the server to track the state of all locks held by each client; the server also needs to know when the client restarts. The design of this functionality is all new in NFSv4 (and somewhat improved in NFSv4.1).

The biggest difference in usability is that, rather than a client reporting that it was rebooted (as the STATMON protocol allows), the NFSv4 client needs to regularly report that it hasn't rebooted. If the server hasn't heard from the client for the "lease time" (typically 90 seconds in Linux) it is permitted to assume the client has disappeared, and must not prevent another client from performing an access that would be blocked by the state held by the first client. So clients that are not otherwise active need to at least send an "I'm still here" message (RENEW in NFSv4.0) often enough so that, even with possible network delays, the server will never go 90 seconds (or whatever is configured) without seeing something.

This all means that, if a machine crashes without rebooting, locked files do not remain locked indefinitely. Conversely, it means that, if a router failure or cabling problem causes network traffic to be interrupted for too long (known as a "network partition"), locks can be lost even while the client is still up and, when the network is fixed, the client will not be able to proceed. On Linux, the client application will receive an EIO error for any attempt to access a file descriptor on which it held a lock that has since been lost.

All of this could have been achieved relatively simply. For example, each request could contain a timestamp of when the client booted. The server would remember this against the IP address of the client and, if it ever changed, or if nothing were seen for a period of at least the lease time, the server could discard any state that the client previously owned. Correspondingly, each reply from the server could contain a timestamp so that server reboots could be detected by the client. However, this would have been too simplistic. The designers of NFSv4 had considerable experience with NFSv3 and with the Network Lock Manager to guide them, and they decided that there was sufficient justification for some more complexity.

Clientids and the client identifier

Depending on a client's IP address is not really a good idea. Partly, this is because running multiple clients in user space, or using network address translation (NAT), can result in several clients having the same IP address, and partly because mobile devices can change their IP address. The latter wasn't a big concern during NFSv4.0 development (though 4.1 handles it better), but user-space clients and the problems of NAT certainly were. Different clients could be identified by their port number, but if a client connecting through a NAT gateway lost its connection and had to re-establish it, the new connection could use a different port, thus appearing to be a different client. So NFSv4 requires each client to generate a universally unique client identifier (which can be up to 1KB in size), combine that with an instance identifier (like a boot timestamp), and submit both to the server via the SETCLIENTID request. The server responds with a 64-bit clientid number that can then be used in any request in which the client needs to identify itself.

This client identifier is the one recently discussed at the Linux Storage, Filesystem, and Memory-management Summit (LFSMM). By default, Linux uses the host name as the main source of uniqueness. This works well enough on private networks when hosts are well configured. Problems arise, though, in containers that do not configure a unique host name but which do create a new network namespace and, as a result, get an independent NFS client instance. Problems may also arise in situations where clients in different administrative domains (and hence with possible host-name overlap) access a shared server.

stateids and seqid — per-file state

Another shortcoming with the simple approach is that it collects all of the state together without clear differentiation in either space or time.

Differentiation in space means that the state of each file can be managed separately. In particular, if the server hasn't heard from the client for the lease time, it must discard any state for which there is a conflicting request, but it isn't required to discard state which is uncontested. So, when the client regains contact, it might have lost access to some files but not others. This requires that it be possible to identify different elements of state so that the server can tell the client which have been lost and which are still valid. The Linux NFS server wasn't able to realize the full benefits of this until recently, when the "Courteous Server" functionality was merged.

This finer-grained state management is largely realized by the NFSv4 OPEN request. The very existence of OPEN is a departure from the approach of NFSv3 and is only possible because the server can track the state of the client and, in particular, which files it has open. An OPEN request can indicate which of READ or WRITE access is required, or both, and it can ask the server to DENY READ or WRITE access to all other clients. This denial is anathema for POSIX interfaces, but is needed for interoperability with other APIs. The Linux server supports such denial between NFSv4 clients, but doesn't allow local access on the server, or NFSv3 access, to be blocked; in addition, existing local opens do not cause a conflicting NFSv4 OPEN to fail.

The NFSv4 OPEN request also indicates an "open owner", which is formed from an active clientid combined with an arbitrary label. The Linux client generates a different label for each user so, if a single user opens a file multiple times concurrently, the server will just see the file opened once. The OPEN request returns a "stateid" which represents the open file and should be used in any READ/WRITE requests. Each such stateid is distinct and can be invalidated by the server (in exceptional circumstances) without affecting any other.

Subsequent OPEN or OPEN_DOWNGRADE requests can change both the access flags and the DENY flags associated with the file (for the relevant open-owner). Each of these yields a new stateid, though it is not entirely new. Each stateid has two parts: the "seqid", which increments on each change, and an "other" part, which doesn't. This allows the client to unambiguously determine whether a second OPEN request opened the same file as the first — as the "other" parts of the stateids will match. It also makes it clear in what order various changes were performed on the server, so the client can be sure it remains synchronized with the server. Thus the "seqid" gives a time dimension to the states.

This can be particularly relevant when a CLOSE request is sent at much the same time as an OPEN request for the same file. The client may not know it is operating on the same file, due to the presence of hard links, so this cannot be seen as incorrect behavior by the client. If the server performs the CLOSE first, the OPEN will then likely succeed and all will be well. If the server performs the OPEN first it will increment the seqid of the state for that file and, when it sees the CLOSE, it will reject it because the seqid is old. The file will stay open in either case, which is what the client wanted (since it opened the file twice but only closed it once).

There are similar stateids, including seqids, for LOCK and UNLOCK requests, and these have a corresponding "lock owner". The lock owner will correspond to a process for POSIX locks, or an open file descriptor for OFD locks.

NFSv4.1 — a step forward

The NFSv4 working group went to some trouble to allow for future versions and to describe the sorts of things that might change. This was not in vain and, in January 2010, NFSv4.1 was described in RFC 5661 (with an update in RFC 8881 about 10 years later).

V4.1 contains lots of little improvements based on several years of experience with what had been nearly an entirely new protocol. Many people are suspicious of "dot-zero" releases and, with NFSv4, there is some justification for this. NFSv4.0 does work, and works quite well, but 4.1 works better. Most of the little things don't rate a mention here, but the decision to exclude UDP as a supported transport is interesting because it is user-visible. UDP has no congestion control, so NFS doesn't work well over it in general. Of course, UDP can still be used as long as some other protocol that manages congestion, like QUIC, is layered in between. V4.1 also allows the server to tell the client that it is safe to unlink files that are still open so the clumsy renaming-on-remove can be avoided.

Possibly the biggest user-visible change in NFSv4.1 is the addition of "pNFS" — parallel NFS. This appears to be a marketing term, as it is easy to say but only loosely captures the important changes. With NFSv4.1, it becomes possible to offload I/O requests (READ and WRITE) to some other protocol, which could well communicate over a different medium to a different server. This allows a single NFS client to communicate with a cluster filesystem without having to channel all the requests through a single IP address. This is certainly an extra level of parallelism, but it is not fair to say that earlier NFS did not allow any parallelism. Even NFSv2 could have multiple outstanding requests that the server could be handling concurrently.

These offload protocols, and how they integrate, are described in separate RFCs. There is support for a block-access protocol (RFC 5663) or the OSD object storage protocol (RFC 5664), which might run over iSCSI, for example, or a "flexible file" based approach using NFSv3 or later (RFC 8435) for the data access. This is only of particular interest here because there is a new sort of state that needs to be managed — there are objects called "layouts".

A layout describes how to access part of a file using some other protocol. Each layout has a stateid which can be allocated and then relinquished, so the server always knows which layouts might still be in use. This is important if the server needs, for example, to migrate a file to a different storage location — it probably shouldn't do that while a client thinks it knows what block location it can use to access that file.

Sessions and a reliable DRC

From our perspective of managing state, the biggest change in NFSv4.1 is that the protocol finally allows for a completely reliable duplicate request cache. As described in part 1, this is needed for the rare case when a request or reply might have been lost and the client has to resend a request. In versions up to and including NFSv4.0, the server would just make a best-effort guess at which requests and replies might be worth remembering of a while. In NFSv4.1, it can know.

An NFSv4.1 session is a new element of state that is forked off from the global clientid by the CREATE_SESSION request. Given a clientid and a sequence number, a new sessionid is allocated that has a collection of different attributes associated with it, including a maximum number of concurrent requests. The server will allocate this many "slots" in its duplicate request cache, and the client will assign a slot number less than this number to each request. The client promises never to reuse a slot number until it has seen a reply to the previous request with that number, and the server promises to remember the replies to the most recent request in each slot, if the client asked it to.

The client can even ask the server to store the cached replies in stable storage so that they survive a reboot. If the server implements this functionality and agrees to provide it, then the result is as close to perfect exactly-once semantics as it is possible to get.

There is, superficially, an imbalance here. The requests that are most common (READ, WRITE, GETATTR, ACCESS, LOOKUP) are idempotent and do not need to be cached, yet the server must reserve cache space for each slot, thus either wasting cache space or unnecessarily limiting concurrency. This not a problem in practice, since the protocol allows the client to create multiple sessions with different parameters. It could create one with a large slot count and a maximum cached size of zero, and use this for all idempotent requests. It could also create a session with a more modest slot count and much larger maximum cached size, and use this for requests that mustn't be repeated.

Directory delegations

A third new sort of state in NFSv4.1 — accompanying layouts and sessions — is directory delegations. In NFSv4.0, it is possible to open files, but all interactions with directories remained much as they were in NFSv3, where each operation was discrete and there was no ongoing state. In v4.1, we get something a bit like an OPEN of a directory with GET_DIR_DELEGATION. This request doesn't contain an explicit clientid; instead, it uses the clientid associated with the session that the request is part of. The delegation is essentially a standing request that the client be informed of any changes made to the directory by other clients. Depending on what specifics are negotiated, this might involve the server saying "something has changed, you no longer have the delegation", or it may provide more fine-grained details of what, specifically, has changed. This allows for strong file-name cache coherence and even allows client-side applications to receive notifications of changes. This functionality is not implemented by Linux, either for the server or the client.

NFSv4.2 - meeting customer needs

The latest version of NFS is v4.2, described in RFC 7862 (Nov 2016). That document describes the goals of this revision being "to take common local file system features that have not been available through earlier versions of NFS and to offer them remotely."

In contrast to NFSv4.1, which primarily provided existing functionality in a more efficient or reliable manner, v4.2 provides genuinely new functionality — at least new to NFS. These include support for the SEEK_DATA and SEEK_HOLE functionality of lseek(), which allows sparse files to be managed efficiently, support of posix_fallocate() to explicitly allocate and deallocate space in a file, support for posix_fadvise(), so the client can tell the server what sort of I/O patterns to optimize for, and support for reflinks, which allow content to be shared between files without copying. None of these add any new state to the protocol, so they aren't directly in our area of focus for this discussion.

One new element of functionality that does involve a new form of state is server-side copy. This functionality can support copy_file_range() and can copy between two files on one server or — if the servers cooperate — between two files on different servers. Closely related functionality (using the WRITE_SAME operation) can initialize a file using a given pattern repeated as necessary.

When the client sends a COPY request, the server has the option of either performing the full copy before replying, or scheduling the operation asynchronously and returning immediately. In the latter case, a stateid is returned to the client which represents the ongoing action on the server. The client can use this stateid to query status (for the all-important progress bar) or to cancel the copy. The server can use the stateid to notify the client of completion or failure.

The future for NFS

As yet, there are no hints of an NFSv4.3 in the foreseeable future, and it could be that no such version will ever be described. The v4.2 specification differs from its predecessors in that it is not a complete specification, but instead references the v4.1 specification and adds some extensions. The model for future extension, which is itself extended in RFC 8178 (July 2017), allows for further incremental changes to be added without requiring that the minor version number be changed. This has already been put to good use with RFC 8275 which allows the POSIX "umask" to be sent to the server during file creation, and RFC 8276 which adds support for extended attributes.

There are, of course, still ongoing development efforts around NFS. One of the more interesting areas involves describing how the NFS protocol can be usefully transported on something other than TCP or RDMA, which are the main two protocols in use today.

NFS draft-cel-nfsv4-rpc-tls-pseudoflavors-02 looks at using ONC-RPC (the underlying RPC layer used by NFS) over TLS and, particularly, explores how the authentication provided by TLS can interact with the authentication requirements of NFS. Then, draft-cel-nfsv4-rpc-over-quicv1-00 builds on this to explore how NFS can be used over the QUIC protocol; The "cel" in the names of the drafts refer to Chuck Lever, the current maintainer of the Linux NFS server.

Other drafts and all the RFCs can be found on the web page for the IETF NFSv4 working group.

While NFS, much like Linux, does not seem to be finished yet, it does appear to have come to terms with being a stateful protocol, with all the state fitting into one coherent model. This is highlighted by the way that a totally new form of state — asynchronous copying on the server — was fit into the model in NFSv4.2 with no fuss. So it is likely the future improvements will focus elsewhere, perhaps following the recent moves toward improved security and support for new filesystem functionality. Who knows, maybe one day it will even make peace with Btrfs.

Index entries for this article
KernelFilesystems/NFS
GuestArticlesBrown, Neil


to post comments

NFS: the new millennium

Posted Jun 25, 2022 0:39 UTC (Sat) by gerdesj (subscriber, #5446) [Link] (12 responses)

Great write up about something that I generally take for granted but by 'eck, NFS has shifted some bytes from A to B for me alone. It's nice to get some real insights into the background of these things from someone who knows what they are on about.

NFS for me has a killer feature when compared to SMB and I was only made aware of it by Veeam. This may or may not still be true: SMB will tell you it has received a lump of data whereas NFS will tell you when it has been committed to storage. Is this still true? When you are doing backups to NAS, which involves some pretty monstrous files, this is rather important.

Your 5TB backup is pretty useless with a hole in the middle of it due to a transient error that allowed a block or two to wander off and have a smoke behind the bikesheds and then sidling off for a night out in town instead of resting on disc like good data. OK, a decent backup app will have lots of failsafes available such as reading backups back against the source but ideally data should flow from A to B safely out of the box.

Given the sheer complexity of file access these days - pick a protocol, pick a medium, wedge in a VPN or two and shake it up and see what happens! It's a wonder that files seem to turn up as requested, in one piece.

NFS: the new millennium

Posted Jun 25, 2022 11:42 UTC (Sat) by willy (subscriber, #9762) [Link] (10 responses)

> SMB will tell you it has received a lump of data whereas NFS will tell you when it has been committed to storage.

NFS v2 only allowed for the WRITE command to be acknowledged when the data was on stable storage. NFS v3 turned it into a two-phase commit, allowing the server to tell the client both "I have received it" and "It is now stable".

Two phase commit isn't particularly useful to clients with a competent local cache. The latency of a write is almost irrelevant; you need to know when the write is durable, not just when it's visible to others. It was of great interest to non-Unix clients, though.

NFS: the new millennium

Posted Jun 25, 2022 16:14 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (9 responses)

The write is only durable after fsync, write is not enough. So even on POSIX systems it's useful to know the moment when the write has been received.

NFS: the new millennium

Posted Jun 25, 2022 21:55 UTC (Sat) by willy (subscriber, #9762) [Link] (6 responses)

On a system with a competent local cache, the kernel returns success to the application after the write is copied to the local cache. NFSv3 offers no improvement here because we weren't even calling WRITE in this path. WRITE was called on writeback and on fsync, and the NFSv2 semantics were just fine for this usage.

NFS: the new millennium

Posted Jun 25, 2022 22:29 UTC (Sat) by neilbrown (subscriber, #359) [Link] (5 responses)

> the NFSv2 semantics were just fine for this usage.

Not really. The NFSv2 semantics require the server to sync each request individually, though maybe it could merge concurrent requests by delaying the sync for the first.
The NFSv3 semantics make it easy for the server to gather lots of writes in its cache and sync them all together.

NFS: the new millennium

Posted Jun 25, 2022 22:57 UTC (Sat) by willy (subscriber, #9762) [Link] (4 responses)

I suppose by "a competent local cache", I include the ability for the client to merge writes, which Linux will do.

NFS: the new millennium

Posted Jun 25, 2022 23:10 UTC (Sat) by neilbrown (subscriber, #359) [Link] (3 responses)

NFS limits the size of a single write request in that the server specifies a maximum that the client must honour.
Typically 1MB on Linux. There are costs in making to bigger, so just setting to 1GB wouldn't work. When NFSv3 was first described UDP was still common and 64KB is a hard limit there.
Is there no value in merging multiple 1MB writes on the server?

NFS: the new millennium

Posted Jun 26, 2022 12:15 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

I guess that's going to depend on the backend storage device. You hit diminishing returns pretty quickly above 1MB on any storage device I've ever worked on. Even a 5-disc RAID-5 with 256kB stripe width would handle 1MB writes with aplomb.

NFS: the new millennium

Posted Jun 26, 2022 21:05 UTC (Sun) by janfrode (guest, #244) [Link]

I've seen large throughput improvements by increasing rsize from 1 MB to 16 MB with Oracle dNFS client, towards NFS-Ganesha on top of GPFS (ESS). The ESS does 8+2p erasure coding, with 1 MB strip size, so 16 MB IOs is the optional size. I don't think wsize was as important, since it could buffer up and do full block size writes on server side -- but this large rsize was needed to ensure full block (16MB) reads.

I don't think other NFS clients support larger than 1 MB rsize/wsize, which seems unfortunate for this kind of storage backend.

NFS: the new millennium

Posted Jun 27, 2022 23:31 UTC (Mon) by neilbrown (subscriber, #359) [Link]

> You hit diminishing returns pretty quickly above 1MB on any storage device I've ever worked on.

I suspect you are correct.
How about a database update that needs to perform lots of random writes before a sync is needed? Any single NFS write must be contiguous, so there must be several. Without the v3 COMMIT, every one of those random writes would need to be committed before the write could return.

Or what about writing lots of small files. The NFS client doesn't need to COMMIT until the writeback timer fires, or memory reclaim wants the cache back, or a sync() is requested. Would it not be more efficient to commit when there are lots of dirty files, rather than once for each file?

But the real point is that EVERY other layer in the storage stack has the two phases: write then sync/flush/commit. Why are you so sure that NFS doesn't benefit from also have the same two phases?

NFS: the new millennium

Posted Jul 10, 2022 8:58 UTC (Sun) by ssmith32 (subscriber, #72404) [Link] (1 responses)

Wasn't fsync caught "lying" a while ago? And disks as well, to inflate performance numbers?

Or have I not kept up? I guess SSDs can probably hit their numbers & still guarantee data is truly flushed from every volatile cache...

NFS: the new millennium

Posted Jul 10, 2022 12:04 UTC (Sun) by Wol (subscriber, #4433) [Link]

I don't think fsync was lying. It just assumed disks were telling the truth ...

Cheers,
Wol

NFS: the new millennium

Posted Jun 27, 2022 19:03 UTC (Mon) by jra (subscriber, #55261) [Link]

> SMB will tell you it has received a lump of data whereas NFS will tell you when it has been committed to storage. Is this still true?

No. SMB2 write has a flag that tells it not to return success until the write has hit stable storage.

NFS: the new millennium

Posted Jun 25, 2022 10:12 UTC (Sat) by wtarreau (subscriber, #51152) [Link]

Thanks for this detailed explanation and for the numerous links, Neil!

NFS: the new millennium

Posted Jun 25, 2022 11:36 UTC (Sat) by willy (subscriber, #9762) [Link]

A protocol which came between NFSv3 and v4 that certainly informed the design of v4 was WebNFS. It also integrated several protocols into one and discarded the use of Portmap.

NFS: the new millennium

Posted Jun 25, 2022 23:40 UTC (Sat) by zerolagtime (guest, #102835) [Link] (3 responses)

Thanks for a great look at the internals.
I am very surprised that you skipped over a still-standing limit on the number of gids that will be compared on a complex network.
16 gids is the limit (https://www.xkyle.com/solving-the-nfs-16-group-limit-prob...) and deferring permissions to an external server with the sec mount option seems to be the only way. Is this a protocol or implementation limit?
I’d love more discussion on the evolution of various security functionality, like xattrs, encryption algorithm, and the impact felt by stateful firewalls trying to accommodate port ranges.

NFS: the new millennium

Posted Jun 26, 2022 1:51 UTC (Sun) by neilbrown (subscriber, #359) [Link] (2 responses)

> I am very surprised that you skipped over .....

I skipped over a lot of things. If it didn't relate to state management, then I felt free to skip it unless it helped tell the story.

> (https://www.xkyle.com/solving-the-nfs-16-group-limit-prob...)

That link contains a section "The Best Solution Ever!: A New Option for the NFS Server". I'm glad my contribution there was appreciated :-)

> deferring permissions to an external server with the sec mount option seems to be the only way.

Yes. At least you do need to have the NFS server determine the list of gids. You don't need to use the sec= mount option. Using the default sec=sys works fine with the Linux NFS server if you use --manage-gids. The NetApp server has similar functionality. Others might too.

> Is this a protocol or implementation limit?

It is a limit in the RPC protocol

> I’d love more discussion on the evolution of various security functionality

Not really my area of expertise, sorry.

NFS: the new millennium

Posted Aug 10, 2022 0:05 UTC (Wed) by seamus (guest, #159731) [Link] (1 responses)

Broken/incomplete(?) link

NFS: the new millennium

Posted Aug 12, 2022 2:44 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

See the GP for the full thing; LWN will elide links and the link quoted is indeed the LWN truncation.

NFS: the new millennium

Posted Jun 27, 2022 3:55 UTC (Mon) by lathiat (subscriber, #18567) [Link] (4 responses)

Neil, seems you forgot about the dangerous fragmentation alluded to in the last installment!

Great review though, loved it, thanks! Long time NFS user from web hosting environments since circa 2006.

NFS: the new millennium

Posted Jun 27, 2022 23:24 UTC (Mon) by neilbrown (subscriber, #359) [Link] (3 responses)

> Neil, seems you forgot about the dangerous fragmentation alluded to in the last installment!

I didn't forget - but there wasn't really anything to say. The fact that fragmentation (of the community, or of the protocol) might happen when separate people have separate needs should be obvious. A possible example of this is JMAP which risks fragmenting the IMAP community. Beyond the fact that I didn't like JMAP when I reviewed it 6 years ago, I don't know any details of this or whether it is a real fragmentation.

But NFS didn't fragment (to my knowledge). The community had regular Connectathon gatherings to ensure interoperability and to discuss issues. And importantly the IETF process was started that allowed anyone to contribute to NFSv4. So the IETF process, which I did mention, likely headed off any possible risk of protocol fragmentation.

NFS: the new millennium

Posted Jun 28, 2022 8:47 UTC (Tue) by Sesse (subscriber, #53779) [Link] (2 responses)

I assumed you were talking about UDP fragmentation!

NFS: the new millennium

Posted Jun 28, 2022 23:12 UTC (Tue) by neilbrown (subscriber, #359) [Link] (1 responses)

> I assumed you were talking about UDP fragmentation!

You aren't the only one (I think). See https://lwn.net/Articles/898842/

NFS: the new millennium

Posted Jun 30, 2022 4:13 UTC (Thu) by lathiat (subscriber, #18567) [Link]

Yep that's what I though too. It was quite a famous issue where it would re-assemble fragments incorrectly but then pass the checksum and corrupt the UDP data inflight. Partly to do with the NFS/RPC packet size versus MTU.

All good.. just a misunderstanding :D I just happened to discuss the issue with a colleague the day the first post came out :)


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds