Case-insensitive ext4

Posted Mar 28, 2019 8:39 UTC (Thu) by nim-nim (subscriber, #34454)
In reply to: Case-insensitive ext4 by clugstj
Parent article: Case-insensitive ext4

Any shared filesystem (network filesystem, removable media) will soon become completely unusable if different systems write on it with different default encodings. Treating filenames as opaque bunches of bytes does not work because you need to convey filenames to humans at some point. Humans do not understand raw bytes they understand decoded bytes, and that requires knowledge of the encoding used in filenames.

So any shared filesystem will need to export to userspace the encoding used for each part of its tree (either a single encoding for everything, or separate encodings per subtree).

Casing is something else but once you get past the encoding point casing becomes a less harder to tackle.

Case-insensitive ext4

Posted Mar 28, 2019 15:58 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

> Casing is something else but once you get past the encoding point casing becomes a less harder to tackle.

Not much less. Casing rules depend not just on encoding but also locale, and while it may be practical to enforce a single universal encoding and normalization scheme you're definitely not going to get away with enforcing a single universal locale.

The logical way to handle normalization is to simply disallow non-normalized filenames. The kernel doesn't change the encoding or compare different normal forms, it just verifies that the names of new files are in a particular normal form and returns an error if they aren't. Since all names are already in the same normal form comparisons reduce to exact binary matches. The equivalent for case would be to disallow either lowercase or uppercase characters in filenames (assuming you could even clearly define what is "uppercase" or "lowercase"—it depends on the locale). People put up with that in the DOS era but I don't think it would be considered acceptable today.

The odds that encoding or normalization would be permitted to vary per-filesystem or per-subtree are negligible. Applications aren't prepared to deal with that, nor should they be expected to do so. Any conversions needed for shared filesystems should be handled at the lowest layers of the filesystem, between the storage or network and the kernel.

Case-insensitive ext4

Posted Mar 29, 2019 10:52 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

Assuming the lowest layers of the stack could handle conversions transparently (which I'm doubtful of, that would require low-level knowledge about every possible encoding variation on earth), you still need to know the encoding(s) you start with. Meaning, you have to put at least one pivot encoding definition inside your filesystem.

That's the part people object to, because they are used to the simplicity of pushing encoding problems somewhere else, with "filenames are streams of bytes". Which was not true even for original UNIX. Actual original Unix filename bytes were 7bit ASCII bytes and nothing else.

But 7bit ASCII is useless in a modern i18n world. So you need to record other pivot encoding(s) in filesystems¹.

¹ Record, not reproduce the mistake of original UNIX, that assumed there was a single encoding that would never evolve so there was no need to make it explicit; easy mistake to made in the simpler computer age they lived in; inexcusable mistake to make today.