[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

Case-insensitive ext4

By Jake Edge
March 27, 2019

Handling file names in a case-insensitive way for Linux filesystems has been an ongoing discussion topic for many years. It is a (dubious) feature of filesystems for other operating systems (e.g. Android, Windows, macOS), but Linux has limited support for it. Over the last year or more, Gabriel Krisman Bertazi has been working on the problem for ext4, but it is a messy one to solve. He recently posted his latest patch set, which reflects some changes made at the behest of Linus Torvalds.

At the 2018 Linux Plumbers Conference (LPC), Krisman presented his plan for allowing ext4 filesystems to be case-insensitive. That plan would have enhanced the kernel's Native Language Support (NLS) subsystem to better support multi-byte encodings and expand the case-folding to handle UTF-8. NLS exists to handle filesystems, such as FAT, that support file names with different encodings, which are specified at mount time. Krisman posted his patch set to make those changes in December shortly after LPC, but Torvalds objected to the whole idea:

Why do people want to do this? We know it's a crazy and stupid thing to do. And we know that, exactly because people have done it, and it has always been a mistake.

He went on to list a number of different problems that can arise with case-insensitivity—many of which have occurred along the way. He asked for use cases: "I really want to know what is driving this insanity, and what the actual use-case is." But he made it pretty clear that he was—at a minimum—skeptical.

The old DOS/Mac people thought case insensitivity was a "helpful" idea, and that was understandable - but wrong - even back in the 80's. They are still living with the end result of that horrendously bad decision decades later. They've _tried_ to fix their bad decisions, and have never been able to (except, apparently, in iOS where somebody finally had a glimmer of a clue).

Theodore Y. Ts'o, who has been working with Krisman on this effort, had apparently brought the patch set to Torvalds's attention in a private email that Torvalds quotes. Another reply also didn't make it into the thread, but in that message (which Torvalds also quotes) Ts'o noted that there was no plan to support encodings other than UTF-8 (and ASCII), which would be set on a per-filesystem basis. Case-insensitivity would be set on a per-directory basis. Given that, Torvalds was adamant that the NLS code was the wrong place to make these changes:

Either you have a horrible fundamental design mistake that has different per-filesystem locales, or you don't.

If you don't, you shouldn't be touching any of the nls code.

Whatever unicode tables you use for case folding shouldn't be in the nls code.

Ts'o suggested moving the Unicode handling code to fs/unicode rather than changing the NLS code. He also described the current state of play with regard to case-sensitivity in filesystems for macOS and Windows, as well as for network filesystems like Samba and NFS. Over time, Ts'o said, the inconsistencies in handling file names between different filesystems have mostly been eliminated. In January, Krisman posted version 5 of his patch set, which reflects the switch to the fs/unicode directory.

The patch set also makes a more substantial change in that it switches normalization methods. There are multiple ways to create the "same" string in Unicode, which is known as "equivalence". Two different sets of code points that appear the same to a user, but not to the filesystem, would be confusing, so there are normalization mechanisms to allow comparisons that take equivalence into account. Ts'o described the confusion that can result:

In the bad old-days, MacOS X's HFS+ was not normalization-preserving. So it would force filenames to NFD form --- so if the user tried to create a file named Å, and passed in the Unicode string U+212B to creat(2), HFS+ would store it as U+0041,U+030A and that is what readdir(2) would return. Apple has effectively admitted this was a mistake, and their new APFS doesn't do this any more.

Now, both file systems basically say, "we don't care whether you pass in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we will treat it as the same filename; but readdir(2) will return what you gave us."

The new patch set switched from NFKD to NFD, which in normalization lingo means a switch from "compatibility" to "canonical" decomposition:

The main change presented here is a proposal to migrate the normalization method from NFKD to NFD. After our discussions, and reviewing other operating systems and languages aspects, I am more convinced that canonical decomposition is more viable solution than compatibility decomposition, because it doesn't ignore eliminate any semantic meaning, like the definitive case of superscript numbers. NFD is also the documented method used by HFS+ and APFS, so there is precedent. Notice however, that as far as my research goes, APFS doesn't completely [follow] NFD, and in some cases, like <compat> flags, it actually does NFKD, but not in others (<fraction>), where it applies the canonical form. We take a more consistent approach and always do plain NFD.

As those quotes indicate, normalization is a messy business. In fact, the whole problem of case handling is a horrific mess, as Torvalds (and others) noted. But there are use cases, mostly involving interoperability with other operating systems. In addition, user-space implementations, with a variety of shortcomings, exist for both Android (to support /sdcard) and Samba—those could perhaps be replaced with an in-kernel solution.

That posting did not generate all that many comments, though there was a question from Pali Rohár about the normalization change. He was concerned that NFD would be incompatible with various other Linux user-space tools. But Krisman explained that the patch set implements name-preserving semantics and that NFD is only used internally for comparison.

Handling invalid UTF-8 byte sequences also came up. There are effectively two possible ways to handle the problem, Krisman said. Either the filesystem can reject any file name that is invalid UTF-8 (and fix any that are found on the disk) or to simply treat an invalid UTF-8 file name as it would be today, so there would be no case-folding or normalization. Both are implemented and a given filesystem's behavior can be configured with a feature flag; the default is to treat them as an opaque byte sequence as they are currently.

On March 18, Krisman posted version 6, with few changes from the previous version. He is trying to flush out any opposition to the normalization change (or anything else in the patch set), presumably in the hopes of getting it upstream soon. So far, there has only been a question from Randy Dunlap about the impact on ext3 filesystems, which are handled by the ext4 code. Ts'o noted that "strictly speaking, there is no such thing as an 'ext3 file system'" these days. Filesystems handled by the ext4 code are defined by the feature bits they have set; if you create a filesystem using "-t ext3" and do not override any of the options, though, it will not have any of the new features enabled, thus it will be unaffected by them.

In order to use the feature, the filesystem will need to be created with encoding-awareness information stored in the superblock. On an encoding-aware ext4 filesystem, case-insensitivity can be enabled on an empty directory (and its children) by setting an inode attribute. That can be done using the EXT4_CASEFOLD_FL ioctl() command, though eventually the chattr command would presumably be updated to add support for the case-folding flag. It should be noted that case-folding and ext4 encryption cannot be used concurrently for the same directory, though Krisman is planning to change that restriction down the road.

Both encoding-awareness and case-insensitivity are fairly large changes to the traditional handling of file names. Unix file names have always been sequences of any byte values (except NUL and "/") without being interpreted in any way. If these changes are adopted, some ext4 filesystems will now be substantially changing the semantics of various filesystem operations. File creation and renaming will no longer operate the way they do today, for example.

However, case-insensitivity is a feature that has been a long time coming and we may see it in the mainline before long. At this point, though, it has only run the gauntlet of the filesystem mailing lists; when it gets posted to linux-kernel, there may be others with opinions—or outright objections. If not, though, Linux 5.3 or 5.4 might just have a feature that has been on some people's wish lists for a decade or two.

`
Index entries for this article
KernelFilesystems/Case-independent lookups
KernelFilesystems/ext4
KernelUTF-8 encoding


to post comments

Case-insensitive ext4

Posted Mar 27, 2019 18:12 UTC (Wed) by clugstj (subscriber, #4020) [Link] (36 responses)

I've yet to see a legitimate use case for putting this brain damage in the kernel. Does anyone actually have one?

Case-insensitive ext4

Posted Mar 27, 2019 19:08 UTC (Wed) by marcH (subscriber, #57642) [Link] (1 responses)

> I've yet to see a legitimate use case for putting this brain damage in the kernel

Without overly complicated code security researchers wouldn't have any work to do!

> Case-insensitivity would be set on a per-directory basis

Insanity has no limit. I was using the (otherwise pretty cool) Windows Subsystem for Linux. This is what happened:
https://github.com/vector-of-bool/vscode-cmake-tools/issu...

Because I was using the same project sometimes from WSL and sometimes from Windows, some directories *in the same project* were created case-sensitive and others not. Hilarity ensued.

Case-insensitive ext4

Posted Mar 27, 2019 19:47 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

Well, CMake has "is case sensitive" logic baked in at compile time. Apple and Windows are "always case insensitive" and everything else is always case sensitive. I don't know what kinds of changes would be required in build tools to do this case insensitive comparisons. For example, this just doesn't work with ninja on Windows (I assume make has similar issues with the analogous ruleset):

rule copy
command = cp $in $out
build foo: copy in
build bar: copy FOO

saying that no rule makes FOO even though technically it will exist if you build foo. Basically, build tools that exist today need cases to match everywhere. And yes, ninja could figure this out right now, but if `dir/foo` and `dir/FOO` is used and `dir` is made by some rule during the build, its case sensitive flag can't be known at the start.

Case insensitivity in filesystems is broken. Conditional case sensitivity at a per-filesystem level means even ninja needs to add ioctl queries to figure that out, but `--one-file-system` is something that is at least enforceable. Per-directory flags which require magical "what will the flag on this directory be in the future" is even more broken.

I'd be surprised if "doesn't work in case insensitive ext4 directories" (nevermind an environment with a mix of case sensitive and insensitive directories) issues don't get closed as WONTFIX in many tools.

Case-insensitive ext4

Posted Mar 27, 2019 19:10 UTC (Wed) by Karellen (subscriber, #67644) [Link] (20 responses)

Yeah - I wonder what is the specific use case that would not be solved better by having case-insensitive globbing and autocompletion in the shell? In what other situation does a program know that it needs to open a file, and knows the name of that file, but doesn't know the precise capitalisation/normalisation of the name?

Case-insensitive ext4

Posted Mar 27, 2019 19:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (19 responses)

If you don't add case-insensitive version of open() and friends then every open() call will have to scan the whole directory first. This adds up quickly for Samba and other file-server use-cases.

Case-insensitive ext4

Posted Mar 27, 2019 20:02 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (15 responses)

It could as well maintain a userspace dictionary mapping normalized/ lowercased names to their actual names (which could be maintained incrementally based on filesystem change notifications).

Case-insensitive ext4

Posted Mar 27, 2019 20:05 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

Linux has no filesystem notification mechanisms that have required consistency and performance for a fileserver use-case.

Case-insensitive ext4

Posted Mar 27, 2019 21:17 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (8 responses)

There is no such thing as "a fileserver use case". Samba exists and has existed (and been used) for a while, hence, there are obviously "file server use cases" where the existing mechanisms perform well enough.

Case-insensitive ext4

Posted Mar 27, 2019 21:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Samba exists, but it does case insensitivity in a very expensive way. There are other use-cases as well, like mounting FAT filesystems.

I've seen this firsthand - I'm using a Linux server for TimeMachine backups for Mac OS X. TimeMachine is braindead - it creates hundreds of thousands files in the same directory. With the default settings Samba slowed down to a crawl.

Fortunately, TimeMachine doesn't care about file name cases. So by following steps from here: https://wiki.samba.org/index.php/Performance_Tuning I was able to speed up backups by something like 10x. This is not insignificant and it would be nice for Linux to handle similar use-cases natively.

Case-insensitive ext4

Posted Mar 28, 2019 0:50 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link] (6 responses)

> there are obviously "file server use cases" where the existing mechanisms perform well enough.

Have you talked to Samba developers and asked them if they are happy with the current performance or would like to see better support from the kernel? If you haven't I would encourage you to do that or talk to enterprises supporting Samba or even large customers. I think you will find that perspectives useful to add to your opinions.

Case-insensitive ext4

Posted Mar 28, 2019 16:10 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (5 responses)

Someone claimed that Samba opening a file residing on a case-sensitive filesystem would require a pre-open, linear directory traversal. As I pointed out, this isn't true, at least not on Linux: It would be possible to use an incrementally maintained, userspace translation cache instead, however, unless I again have to use Samba for something in a resource-constrained environment, I'm not going to implement that and in the unlikely case that this would happen, I'd certainly not go through the rather pointless hassle of trying to contribute a non-trivial change to an open sausage project, as I have neither the time to do this nor the social skills and pedigree to do so successfully.

Case-insensitive ext4

Posted Mar 28, 2019 18:36 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> As I pointed out, this isn't true, at least not on Linux: It would be possible to use an incrementally maintained
Nope. There is no way to maintain this cache with any sort of consistency guarantees. Linux filesystem change notifications are not up to it.

Case-insensitive ext4

Posted Mar 29, 2019 3:03 UTC (Fri) by pabs (subscriber, #43278) [Link] (2 responses)

What are they missing now that recent Linux versions offer rename notifications and other directory change notifications?

Case-insensitive ext4

Posted Mar 29, 2019 4:21 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

inotify is not recursive. It's also best-effort and its notifications are asynchronous.

fanotify is better, but it also can drop events from time to time under high load.

Case-insensitive ext4

Posted Oct 4, 2023 18:51 UTC (Wed) by calumapplepie (guest, #143655) [Link]

You get an event when the queue overflows. If you clear the cache on receiving such an event, you can provide consistency guarantees, at the cost of bad performance while the cache is rebuilt. Since the queue is pretty big, overflows shouldn't happen too much.

Case-insensitive ext4

Posted Mar 29, 2019 22:55 UTC (Fri) by jra (subscriber, #55261) [Link]

We already have an incrementally maintained, userspace translation cache in Samba. It catches the simple cases where we've seen a filename before - we cache it.

Unfortunately it isn't enough. Cache misses are the problem. If the SMB client sends a filename "foo" and it isn't in the directory, we don't know if it doesn't exist, or exists under another case (e.g. as "Foo"). In that case we need to scan the directory. This gets really expensive, really quickly.

We don't negatively cache as we're often used to export filesystems that local processes are also modifying.

I've been wanting a case-insensitive filesystem lookup option in Linux for a long time (I think ZFS and XFS already have it, however flawed).

Case-insensitive ext4

Posted Mar 28, 2019 7:28 UTC (Thu) by patrakov (subscriber, #97174) [Link] (1 responses)

Wouldn't it simplify things if SAMBA stopped any attempts to export an existing directory tree? I.e. mandate that the only way to make a new file exported is to copy it in via the SMB protocol, quite possibly from localhost. Keep filenames opaque, keep files in a clearly-private area, teach users not to mess with them (like they don't directly mess with MySQL files). Keep whatever attributes Windows needs in some sort of a database.

Case-insensitive ext4

Posted Mar 28, 2019 7:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

This is doable and is fairly easy, given that Samba has a well-defined pluggable VFS layer.

But this will break a ton of other software that wants to directly modify the disk files. It will also mean that Linux's VFS is inadequate for a fairly common use-case.

Case-insensitive ext4

Posted Mar 27, 2019 20:17 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (2 responses)

Ideally this would be some kind of LRU cache which would have some flag to say "this is a one-off open, don't cache" to avoid the inotify (or whatever) mechanism. Plus, I'm sure folks would love having the C library run a thread in the background to listen for its notifications taking locks on this cache whenever something happens. Yeah, I don't see any race conditions, unpredicitable latency issues, or TOCTOU/cache coherency issues here at all.

Sorry for the snark, it's not in response to your comment in particular, but my mind coming up with all the Pandora's boxes this is threatening to open.

Case-insensitive ext4

Posted Mar 27, 2019 21:11 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

There's no point in special-casing "one-off opens" unless this demonstrably solves a problem. As then kernel open has to scan the directory, anyway, you'll end up with the exact same kind of TOCTOU races. This is a problem which can't really be solved. As to your other objections: These is a generic list of programming errors, some of them attributable to the idea with "a background thread".

It's possible to implement case-insensitive open in user space without doing a second linear search through a directory for every open.

Case-insensitive ext4

Posted Mar 27, 2019 21:19 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

As I understand, the kernel will keep track of canonicalized names in file cache, so it won't have to do a search.

There's also the problem of making sure that no duplicate files exist.

Case-insensitive ext4

Posted Mar 29, 2019 9:16 UTC (Fri) by Karellen (subscriber, #67644) [Link] (2 responses)

Why?

How is a call to open() getting the filename to open? Either it's going to from an existing directory scan, in which case the capitalisation/normal form should already be correct, or it's going to be because a user has selected a file - in which case the shell/picker/whatever should be able to do that work already?

Where would calls to open() be getting these correctly named but incorrectly capitalised/normalised filenames from?

Case-insensitive ext4

Posted Mar 29, 2019 9:20 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> How is a call to open() getting the filename to open? Either it's going to from an existing directory scan, in which case the capitalisation/normal form should already be correct, or it's going to be because a user has selected a file - in which case the shell/picker/whatever should be able to do that work already?
You have an SMB request to open a file, with a file name. There's nothing else.

You can try a happy case and just attempt an open() with the provided name. If it fails, you need to scan the directory to find a matching file with a different case.

And you can't really cache the negative result, patterns like "if !exists(fname) {creat(fname);}" are exceedingly common.

Case-insensitive ext4

Posted Apr 4, 2019 17:09 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Where would calls to open() be getting these correctly named but incorrectly capitalised/normalised filenames from?

The user, maybe?

What about the use case where I type in a name in a picker, and it displays a bunch of matches?

Or what about the case where I typed in the name on the command line? Some of us still use a command line, you know ...

Cheers,
Wol

Case-insensitive ext4

Posted Mar 28, 2019 2:05 UTC (Thu) by dw (subscriber, #12017) [Link] (1 responses)

as a recent MacOS refugee, frankly I find dicking around with case intensely annoying on a desktop after discovering things don't have to be that way. Sure it's supposed to be in userspace -- good luck with that. Darwin has an approach that works for users, and I'd be very happy to enable this flag the moment it becomes available.

Was disgusted just last night to discover a Gtk chooser dialog's autocomplete was case sensitive. In a GUI. Total disconnect between Linux and what the real world has been doing successfully for decades now..

Case-insensitive ext4

Posted Mar 28, 2019 10:44 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

Case-insensitivity on ext4 volumes has to be done in userspace, because doing it in the kernel breaks userspace.

I agree that the Gtk file chooser having case-sensitive autocomplete is daft, but... I don't actually care, because I hate the Gtk file chooser anyway for other, more fundamental design decisions.

Case-insensitive ext4

Posted Mar 28, 2019 2:10 UTC (Thu) by dw (subscriber, #12017) [Link] (5 responses)

As a recent MacOS refugee (after a long previous history on desktop Linux), frankly I find dicking around with case on a desktop extremely annoying, after discovering things don't have to be that way. Sure it's supposed to be in userspace -- good luck retrofitting all that, and really it's just passing the blame. Darwin has an in-kernel approach that works for users, and I'd be very happy to enable this flag the instant it becomes available.

I was disgusted just last night to discover a Gtk chooser dialog's autocomplete was case sensitive. In a GUI. In 2019. Total disconnect between Linux and what the real world has been doing successfully for decades, and what actual users expect. No doubt someone will pop up to say 'but I prefer it that way', well, you're free patch whatever brainwrong you like into your desktop, but most people cannot and do not want that -- it's why contemporary developers are walking around with MacBooks rather than Linux boxes

Case-insensitive ext4

Posted Mar 28, 2019 3:56 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (4 responses)

What do you expect of tools that have to deal with non existent files in a case insensitive world? Mainly, I'm thinking of build tools here, but there are other categories too where this crops up. Should make expect that there is a dependency here?

foo:
touch foo
bar: Foo
cp Foo bar

Because if so, this means that tools now need to make a syscall just to do path manipulation to be accurate (something like canonpath() that would give a path which is the same for all equivalent input paths maybe by doing tolower() and normalization). And it has to work for paths that don't exist yet. And I don't think that can even be correct because that path might end up having a bind mount in there at some point which changes behavior (yeah, low chance, but kernels don't always have that luxury).

Yeah, case insensitivity might be useful at the UI level, but even there you still have to deal with paths using binary data or invalid utf8 because a file that the GUI can't delete is a wonderful thing to diagnose and resolve. Personally, I don't find it that useful (but I encourage you to file an issue against GTK for the completion thing).

Case-insensitive ext4

Posted Mar 28, 2019 19:11 UTC (Thu) by jccleaver (subscriber, #127418) [Link] (3 responses)

Ultimately, this really brings to mind how important initial design is for things.

Classic Mac OS was designed with case-insensitivity in mind, had no manual tools that needed to be imported with minimal effort rather than a complete rewrite, and had no shell mechanics to emulate.

Case Insensitivity #JustWorks when people expect it and are going through translation layers (and aren't in the business of writing drivers), and doesn't when people assume low level access.

Case-insensitive ext4

Posted Mar 28, 2019 20:14 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

I still wonder how this would have worked even on Classic Mac OS. Do you just assume that *all* paths can be normalized regardless of location or host filesystem? If so, do you just not support filesystems with alternate paths? Though I suppose the Windows solution of mangling unsupported names on the render side works too[1]. However, this means that any path manipulation has to do a syscall to get some canonical representation of a path or each program has to have a "pathcmp" function to determine that "foo" and "FOO" are really the same thing.

Would you have expected the shown Makefile snippet to work on Classic Mac OS or would an error that "no rule to make FOO" be acceptable?

[1]Making a path appear in Explorer via a network share with the name "CON1" renders as some mangled name. Creating a file with that mangled name then shows two files with the same name appear. Deleting either one via the UI deletes the one with the real mangled name first (I assume given a HANDLE, they can be differentiated).

Case-insensitive ext4

Posted Mar 28, 2019 20:35 UTC (Thu) by k8to (guest, #15413) [Link]

Yeah, mounting case sensitive filesystem on classic MacOS would have been messy. I'm sure I did this at some point with e.g. Basilisk II mounting the Linux filesystem underneath it, but that had a hefty translation layer to support the other oddities of classic MacOS like forks etc.

I think that was the approach taken by other people too, probably one of Apple Single or Apple Double representations which probably had some solution for NFS which was still in vogue in the 90s.

It wasn't that nice an experience for the Mac users or the non-mac users. I never programmed against it to experience the extra sharp edges, though.

Case-insensitive ext4

Posted Mar 28, 2019 21:01 UTC (Thu) by jccleaver (subscriber, #127418) [Link]

> Do you just assume that *all* paths can be normalized regardless of location or host filesystem?

I think by System 7.5 (or 7.1 Pro) you did, because if I recall correctly that's how File Exchange/PC Exchange did its work.

Remember, in classic Mac OS the colon ':' was the directory separator in paths, and you could use '/'s to your heart's content. Actually, you could use pretty much anything to your heart's content, including spaces, punctuation (since no one in the Mac side cared about extensions) and even weird graphs like the f-hook or florin https://en.wikipedia.org/wiki/%C6%91#Appearance_in_comput... , which I still find myself occasionally doing on OS X 20 years later.

Anyway, with /. \. and : being used in different locations, there was definitely path-mangling going on below the interface. But general users didn't have to care, and most Mac programs didn't deal with constructed path names, and *never* had to worry about shell-quoting for spaces and whatnot.

Between this freeform text attitude, the resource and data fork dichotomy, and the use of Type and Creator codes, I definitely feel like we've lost some good capabilities on the Mac side in the quest for broader interoperability.

Case-insensitive ext4

Posted Mar 28, 2019 8:11 UTC (Thu) by daniels (subscriber, #16193) [Link]

> I've yet to see a legitimate use case for putting this brain damage in the kernel. Does anyone actually have one?

No, everyone involved is just doing this for absolutely no reason at all. Weird.

Case-insensitive ext4

Posted Mar 28, 2019 8:39 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (2 responses)

Any shared filesystem (network filesystem, removable media) will soon become completely unusable if different systems write on it with different default encodings. Treating filenames as opaque bunches of bytes does not work because you need to convey filenames to humans at some point. Humans do not understand raw bytes they understand decoded bytes, and that requires knowledge of the encoding used in filenames.

So any shared filesystem will need to export to userspace the encoding used for each part of its tree (either a single encoding for everything, or separate encodings per subtree).

Casing is something else but once you get past the encoding point casing becomes a less harder to tackle.

Case-insensitive ext4

Posted Mar 28, 2019 15:58 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> Casing is something else but once you get past the encoding point casing becomes a less harder to tackle.

Not much less. Casing rules depend not just on encoding but also locale, and while it may be practical to enforce a single universal encoding and normalization scheme you're definitely not going to get away with enforcing a single universal locale.

The logical way to handle normalization is to simply disallow non-normalized filenames. The kernel doesn't change the encoding or compare different normal forms, it just verifies that the names of new files are in a particular normal form and returns an error if they aren't. Since all names are already in the same normal form comparisons reduce to exact binary matches. The equivalent for case would be to disallow either lowercase or uppercase characters in filenames (assuming you could even clearly define what is "uppercase" or "lowercase"—it depends on the locale). People put up with that in the DOS era but I don't think it would be considered acceptable today.

The odds that encoding or normalization would be permitted to vary per-filesystem or per-subtree are negligible. Applications aren't prepared to deal with that, nor should they be expected to do so. Any conversions needed for shared filesystems should be handled at the lowest layers of the filesystem, between the storage or network and the kernel.

Case-insensitive ext4

Posted Mar 29, 2019 10:52 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

Assuming the lowest layers of the stack could handle conversions transparently (which I'm doubtful of, that would require low-level knowledge about every possible encoding variation on earth), you still need to know the encoding(s) you start with. Meaning, you have to put at least one pivot encoding definition inside your filesystem.

That's the part people object to, because they are used to the simplicity of pushing encoding problems somewhere else, with "filenames are streams of bytes". Which was not true even for original UNIX. Actual original Unix filename bytes were 7bit ASCII bytes and nothing else.

But 7bit ASCII is useless in a modern i18n world. So you need to record other pivot encoding(s) in filesystems¹.

¹ Record, not reproduce the mistake of original UNIX, that assumed there was a single encoding that would never evolve so there was no need to make it explicit; easy mistake to made in the simpler computer age they lived in; inexcusable mistake to make today.

Case-insensitive ext4

Posted Mar 28, 2019 14:28 UTC (Thu) by smurf (subscriber, #17840) [Link]

Yes. Besides case insensitivity there's also the issue of differently-normalized file names. I would like to have one well-defined on-disk normal form. Otherwise I save "hëllo.txt" (e, combining diaraesis) and then fail to open "hëllo.txt" (e-with-diaraesis). This problem affects my desktop interface as well as web servers with "interesting" URLs.

Case-insensitive ext4

Posted Mar 27, 2019 18:38 UTC (Wed) by hkario (subscriber, #94864) [Link] (11 responses)

the problem is not the encoding of the file names on disk, as long as all the users of the file system (as in, the operating systems, thinking of a USB stick) what the encoding is, the exact one is an implementation detail - though the names need to be normalized before being written to disk (or read and compared with other)

what IS important is what LOCALE the file system, or rather the user, is working

lower case "I" (India) in Turkish locale is a letter "ı" (dot-less i). And no, the "I" in Turkish is not any different than the "I" in English, German or Polish, it's the same Unicode codepoint.

also, there's "İ" that is down-cased to "i" in Turkish locale, and again, the "i" is not special

combine this with two users that work in different locales on the same file system and "fun ensues"

Case-insensitive ext4

Posted Mar 27, 2019 19:04 UTC (Wed) by k8to (guest, #15413) [Link]

You can even have this with two processes for the same user, although it's less likely. It's not that unlikely.

It's a common defensive pattern to set the locale to something like "C" when you programmatically fire off utilities like 'ps' e.g. for portable process-id validation, and although this is can't produce as much confusion as turkish vs en_US.utf8, it can produce similar problems.

Case-insensitive ext4

Posted Mar 27, 2019 19:11 UTC (Wed) by marcH (subscriber, #57642) [Link]

Table of Contents of https://www.b-list.org/weblog/2018/nov/26/case/
"Truths programmers should know about case"
> - There are more than two cases
> - There’s more than one way to determine case
> - You can’t tell a character’s case from looking at it (or from its name)
> - Some characters have no case
> - Some characters may appear to have multiple cases
> - Case is context-sensitive
> - Case is locale-sensitive
> - Case-insensitive comparison requires case folding
> - Enough for now "still not exhaustive on its topic"

Case-insensitive ext4

Posted Mar 27, 2019 19:15 UTC (Wed) by juliank (guest, #45896) [Link] (3 responses)

Meh. Can just treat I = ı = İ = i

Case-insensitive ext4

Posted Mar 27, 2019 19:39 UTC (Wed) by mpr22 (subscriber, #60784) [Link] (1 responses)

No, you can't.

Because people get murdered when you do that.

Case-insensitive ext4

Posted Mar 27, 2019 20:46 UTC (Wed) by dvdeug (subscriber, #10998) [Link]

Top of any programmer's things to know about names list is that they're not unique. Someone who takes substantive action based on whether or not a name matches is wholly responsible for their own screwups.

Case-insensitive ext4

Posted Mar 28, 2019 10:42 UTC (Thu) by hkario (subscriber, #94864) [Link]

you can treat them as the same to the same degree as you can consider "y" to be equivalent to "i", or "oo" to "u"

Just because you grew up with an alphabet that has 26 letters doesn't mean that it's the only alphabet in use.

Case-insensitive ext4

Posted Mar 27, 2019 19:30 UTC (Wed) by marcH (subscriber, #57642) [Link] (4 responses)

> lower case "I" (India) in Turkish locale is a letter "ı" (dot-less i). And no, the "I" in Turkish is not any different than the "I" in English, German or Polish, it's the same Unicode codepoint.

For some strange reason many (most?) French people think the upper case of:
- é, à, ü, î...
are without accents like:
- E, A, U, I,...
but if you look at any half-professional book or [online] newspaper you'll find:
- É, À, Ü, Î,...

Windows keyboard for France (!= for French) makes it incredibly hard to enter the correct ones.

The spell checker in Microsoft Word has a setting letting you decide which one you think is correct:
https://www.pcastuces.com/pratique/astuces/1718.htm

Case-insensitive ext4

Posted Mar 28, 2019 7:32 UTC (Thu) by andrewsh (subscriber, #71043) [Link] (1 responses)

That’s why upper-case in documents should be a text style, not ACTUAL capitals in the text.

Case-insensitive ext4

Posted Mar 28, 2019 11:05 UTC (Thu) by hkario (subscriber, #94864) [Link]

Good luck writing about the German ß, that can be upper-cased to either SS (the old convention) or ẞ (the new convention).
But when you down-case SS, you do it to ss.

Case-insensitive ext4

Posted Mar 28, 2019 8:47 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

> For some strange reason many (most?) French people think the upper case are without accents

People are just used to broken systems (broken keyboards on typewriters and computers, broken apps, in legacy typesetting "someone broke the small bit of lead corresponding to the diacritic on the capitalized letter"). Humans habits are a huge source of inertia. When Microsoft finally got around to fix Office for French some Microsoft clients actually complained it was now correcting words to the correct spelling.

Proper typesetting shops take care to type correct french (typesetting apps correct the windows user breakage), and Linux diverged long ago from the "official" AZERTY layout to make uppercase with diacritics easy to type (French Canadians were smarter: they fixed their official layout to put caps with diacritics proeminently on them. Pity you can't buy Canadian French keyboards easily in France).

Case-insensitive ext4

Posted Mar 28, 2019 9:01 UTC (Thu) by nilsmeyer (guest, #122604) [Link]

Maybe that's just a matter of convenience? I don't know about keyboards for France, when I want to write É I have to hit three keys in a specific order.

Case-insensitive ext4

Posted Mar 27, 2019 20:30 UTC (Wed) by flussence (guest, #85566) [Link] (3 responses)

> Either the filesystem can reject any file name that is invalid UTF-8 (and fix any that are found on the disk) or to simply treat an invalid UTF-8 file name as it would be today, so there would be no case-folding or normalization.

Maybe having it reject by default, if only for a while, will prompt people to fix the tools generating invalid UTF-8 filenames in the first place. /usr/bin/zip is notorious for this; I've started using 7zip to extract .zip files because it gets it right.

Case-insensitive ext4

Posted Apr 5, 2019 20:01 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

I'd be inclined to copy the Windows trick, actually, where they have two file names.

Seeing as it seems it can only be activated on an empty directory, the i-nodes would have to have space for two entries - the canonical form and the user form. Any file-system access converts to canonical and searches on both, but a collision on the canonical name will take the appropriate behaviour.

That might also provide a way for switching a used directory to case-insensitive - the canonical field would start out empty, but writing to the directory would do a canonical search and return an error if there was a collision. Be a bit messy though possibly.

Cheers,
Wol

Case-insensitive ext4

Posted Apr 8, 2019 6:57 UTC (Mon) by lkundrak (subscriber, #43452) [Link]

> I'd be inclined to copy the Windows trick, actually, where they have two file names.

Wasn't their patent on a file having two names what the TomTom lawsuit was about?

Case-insensitive ext4

Posted Apr 8, 2019 19:38 UTC (Mon) by nix (subscriber, #2304) [Link]

inodes have no name, and we already have a mechanism for one file having as many names as you like: hard links. I'm sure this can be repurposed somehow...

Case-insensitive ext4

Posted Mar 28, 2019 11:20 UTC (Thu) by mjthayer (guest, #39183) [Link] (1 responses)

To my mind part of the problem is the expectation that there is a clean solution to the problem. Human writing has been in existence and evolution for thousands of years, and not as a single consistent entity at that. Even Latin script is split into many many sub-scripts (the Turkish "i" is not an electronic invention). Computer encodings, locales and filesystems have not been around that long, but have also picked up more than enough legacy which we can't just wish away. And we want file names to solve a range of problems, some of which are more relevant to users and some to developers.

Having said that, having file systems encode file names as byte strings and having mechanisms to query those uninterpreted or case-insensitive (or whatever) as processes require seems to me a reasonable square of the circle.

Case-insensitive ext4

Posted Mar 28, 2019 20:40 UTC (Thu) by k8to (guest, #15413) [Link]

I agree with most of that.

However, I'm not sure that allowing people to create both Foo and foo and then having applications use an interface that "asks for foo" in an insensitive fashion is going to produce a lot of happiness.

Case-insensitive ext4

Posted Mar 29, 2019 22:41 UTC (Fri) by mirabilos (subscriber, #84359) [Link]

The whole idea is crazy.

Additionally, did someone ask the Turkish people? (PHP fucked them up, because the word “function” contains a dotted i, and PHP insists on being case-insensitive…)

There, I ≠ i because they have I ↔ ı and i ↔ İ so you need locales.

This is the ultimate proof that case-insensitivity cannot (and therefore MUST NOT) be done on the filesystem level.

Case-insensitive ext4

Posted Mar 30, 2019 22:28 UTC (Sat) by jthill (subscriber, #56558) [Link] (1 responses)

Seriously, I think the only even halfway-respectable choice here is to encode the specific locale and case sensitivity with each path component. If that means having a filesystem (probably also system) catalog of encountered locales and a locale index on each directory entry, so be it. Case folding differs by locale, and locale can be either an intrinsic text attribute or a matter of user interpretation. Punting it to system administration is just begging for botchery, it leaves you with third-party implementations of arbitrary choices affecting the results of fundamental operations. If I think my `README` is case-insensitive and my `Makefile` is case-sensitive but `*.exe` should be case-insensitive, well, sadly enough that's a reasonably widely understood situation these days

Case-insensitive ext4

Posted Apr 4, 2019 13:13 UTC (Thu) by bosyber (guest, #84963) [Link]

That might be working for some cases, but I wonder how I'd like that.

I am Dutch, currently living in Germany, and often conversing in English too (like here; plus sometimes receiving Czech documents). Now, none of those are really difficult cases, I think (SS/ss and ß notwithstanding), but they do have differences in how characters need to be interpreted, but while my usage is for a large part context/file dependant, there are overlaps in what I use when, sometimes in a single context, like in chats, online, and sometimes stored in a single directory, due to where, by whom, and for what a document was created.

Not sure how what you propose can work correctly when generalized. Imagine I for some reason add Turkish in the mix too, or when I use a more complex, different set of languages as daily use.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds