Report from the Python Language Summit

By Jake Edge
April 14, 2015

PyCon 2015

The Python Language Summit is an annual event that is held in conjunction with the North American edition of PyCon. Its mission is to bring together core developers of various Python implementations to discuss topics of interest within that group. The 2015 meeting was held April 8 in Montréal, Canada. I was happy to be invited to attend the summit so that I could bring readers a report on the discussions there.

The summit was deemed the "Barry and Larry show" by some, since it was co-chaired by Barry Warsaw and Larry Hastings (seen at right in their stylish fezzes). Somewhere around 50 developers sat in on the talks, which focused on a number of interesting topics, including atomicity guarantees for Python operations, possible plans to make Python 3 more attractive to developers, infrastructure changes for development, better measurement for Python 3 adoption, the /usr/bin/python symbolic link, type hints, and more.

Atomicity: What operations are guaranteed to be atomic for Python and where/how will that be specified?
Making Python 3 more attractive: Adding some big-ticket features might make developers more interested in switching to Python 3.
PyParallel: An alternative Python focused on high performance through parallelism.
Core development infrastructure: Making Python development workflows and infrastructure better for the future.
Python 3 adoption: More ideas on what needs to happen to bring about more Python 3, but also to be able to measure that increase.
The Python symbolic link: Should /usr/bin/python point to Python 2 or Python 3—or perhaps to something else entirely?
Type hints: Guido van Rossum gives an introduction to the new optional type annotation feature for Python 3.5.
Python on mobile systems: A video that described the current status and plans for writing mobile apps using Python.
Adding Requests to the standard library: Should the Requests module be added to the standard library?
PyMetabiosis: An experimental way to allow PyPy to use C extensions by embedding CPython in PyPy.
Jython Native Interface (JyNI): A mechanism to allow Jython to use C extensions.
Python installation options for Windows: A look at the Windows installer and some options for the types of installations it supports in the future.
Python at Heroku: A talk about where Python fits in at cloud-application hosting provider Heroku.

Index entries for this article
Conference	PyCon/2015
Python	Python Language Summit

Report from the Python Language Summit

Posted Apr 15, 2015 0:19 UTC (Wed) by smoogen (subscriber, #97) [Link] (53 responses)

After reading through the articles.. it would seem to me that maybe Python should look at doing a 4.0 tree where it works on getting rid of the GIL and work on garbage collection.. since those changes would require a lot of BREAK-THE-WORLD alterations to the core Python model that the other ython's would need to work on without breaking the current ython-2.x and ython-3.x worlds.

And yes this is probably an incredibly stupid and foolish thing to suggest. I just don't see an easier path due to the complexity and deep fundamental changes required.

Report from the Python Language Summit

Posted Apr 15, 2015 0:23 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (52 responses)

And since pretty much everybody either ignores Python3 or writes code that is compatible with the both versions - the total disruption would be minimal.

Report from the Python Language Summit

Posted Apr 15, 2015 12:56 UTC (Wed) by nix (subscriber, #2304) [Link] (51 responses)

I'm actually doing Python 3 stuff for the first time and am finding the bytes / char distinction *really useful*. Everyone else seems to hate it, but given that I'm emitting output to devices that don't know Unicode but *do* know a variety of weird old codecs like CP437, I have to be careful with encodings anyway. Python 3 doesn't let me get it wrong, which has caught numerous bugs already.

Report from the Python Language Summit

Posted Apr 15, 2015 14:22 UTC (Wed) by jezuch (subscriber, #52988) [Link] (12 responses)

> am finding the bytes / char distinction *really useful*.

Frankly, this usefulness is rather obvious to anyone living in non-ASCII world. When you are using ą, ę, ź, ć, etc. daily (and it's still just a variant of the Latin script), you learn pretty quickly about character encodings. The hard way, usually ;)

Report from the Python Language Summit

Posted Apr 15, 2015 15:01 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> Frankly, this usefulness is rather obvious to anyone living in non-ASCII world. When you are using ą, ę, ź, ć, etc. daily (and it's still just a variant of the Latin script), you learn pretty quickly about character encodings.

It's even more useful when diacritics are rare. If you use diacritics all the time, encode/decode bugs are found quickly. If you don't, encode/decode bugs will remain hidden, until one day they blow up as one of the dreaded Unicode encode/decode exceptions (or worse, show up as mojibake).

Report from the Python Language Summit

Posted Apr 15, 2015 20:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

I live in a non-ASCII world. My keyboard has four layouts and I routinely us languages with non-Latin (or Latin-with-diacritics) scripts.

Yet I find the whole char/byte distinction to be extremely moronic.

Report from the Python Language Summit

Posted Apr 16, 2015 11:34 UTC (Thu) by HelloWorld (guest, #56129) [Link] (5 responses)

> Yet I find the whole char/byte distinction to be extremely moronic.
Yes, you have stated that many times, and the response is still the same: You're confused. Bytes aren't Characters and Characters aren't bytes, period. It's as simple as that.

Report from the Python Language Summit

Posted Apr 16, 2015 18:11 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

So how many characters are in composite symbols?

The only sane and modern way to do Unicode is UTF-8.

Report from the Python Language Summit

Posted Apr 16, 2015 19:27 UTC (Thu) by HelloWorld (guest, #56129) [Link] (1 responses)

> So how many characters are in composite symbols?
What does that have to do with the fact that bytes are not text/strings/characters?

> The only sane and modern way to do Unicode is UTF-8.
Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.

Report from the Python Language Summit

Posted Apr 16, 2015 22:40 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> What does that have to do with the fact that bytes are not text/strings/characters?
The fact that sequences of UCS-4 codepoints are also not text/string/characters, just as sequences of raw bytes.

> Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.
Python3 practically forces one to transcode data from one format to another all the time for no specific reason.

Report from the Python Language Summit

Posted Apr 20, 2015 10:38 UTC (Mon) by niner (subscriber, #26151) [Link]

"So how many characters are in composite symbols?"

Characters? I'd say one. I can definitely say (as far as I understand this anyway) that it's one grapheme and one or more code points.

Perl 6 will deal with strings as sequences of Normalized Form Graphemes (NFG). There's a very interested blog post about what this means:

https://6guts.wordpress.com/2015/04/12/this-week-unicode-...

I guess the only two sane ways of handling Unicode are:
* be completely agnostic and treat stings as opaque sequences of bytes, or
* go all in and work with graphemes whenever possible.

Report from the Python Language Summit

Posted Apr 21, 2015 22:26 UTC (Tue) by nix (subscriber, #2304) [Link]

I'll just tell all the existing systems out there to use UTF-8, even if they don't. I'm sure I can find a way to jam all of Unicode onto the Adafruit-based display board my Python code is talking to: it has a whole 64K of flash and 2K of RAM! I'm sure I can fit glyphs for all of Unicode in there and still have space for everything else it has to do!

No, not everything can use UTF-8, even in an ideal world. And such systems will *always* use different encodings, so to talk to such systems Python's enforced conversion is extremely valuable. And even when you're not, and the system you are talking to uses UTF-8 or some other Unicode variant, the enforced conversion is *still* valuable because it forces you to think about what encoding is in use, and amazingly often it's not straight UTF-8, or it's UTF-8 with extra requirements such as needing to be canonicalized or decanonicalized in a particular way or "oops we didn't say but experimentation makes it clear that $strange_canonicalization is the only way to go". (I have seen all of these on real systems, along with people claiming UTF-8 but meaning UCS-16 because they didn't know there was a difference, and vice versa -- and, in the latter cases, cursed them.)

Report from the Python Language Summit

Posted Apr 16, 2015 21:28 UTC (Thu) by flussence (guest, #85566) [Link] (3 responses)

>Yet I find the whole char/byte distinction to be extremely moronic.

It is, but I'd say because nobody has a clue how to define "char". It can mean all sorts of things depending on where it's used:

* 1 byte in a legacy encoding (or C)
* If you're using a half-baked library, it's 2 or 4 bytes UTF-16. (Qt4 falls into the "completely baked" category as it'll let you backspace over half an emoji character.)
* If you're lucky, someone actually implemented Unicode correctly and a "character" is a variable length sequence of bytes encoding a full ISO-10646 codepoint. Such as U+00C7, which is the character "Ç". Or U+0327, which is a squiggly line and definitely not a character.
* The only sensible and correct definition: a character is the thing you would write by hand — "Ḉ".length == "Ḉ".length == "Ḉ".length — almost no software uses this definition.

Report from the Python Language Summit

Posted Apr 16, 2015 22:38 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> It is, but I'd say because nobody has a clue how to define "char". It can mean all sorts of things depending on where it's used
Correct. And complex scripts or complex characters make it even more complicated.

That's why I violently oppose the definition: "UCS-4 codepoint is a character or GTFO", which Python3 tries to enforce.

Report from the Python Language Summit

Posted Apr 21, 2015 22:31 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

Python says that its internal encoding is Unicode, but the nice thing about the enforced mapping to bytes to get it out anywhere else is that good code need not rely on this at all. As long as you always transcode to/from bytes when leaving the Python world, you can *completely ignore* what internal encoding the thing is using (or, rather, use the Unicode stuff like properties etc as needed inside your code, in the happy knowledge that this will not affect transfer to external systems at all, no matter what encoding they use, as long as a codec exists. And writing codecs is *really* not hard, at least not if you don't care much about performance, e.g. for one-off mappings where no codec exists in Python yet :) )

Report from the Python Language Summit

Posted Apr 21, 2015 23:06 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

A couple of times I had to resort to hacks like putting raw bytes into the LSB of UCS-4 codepoints.

I still think that treating strings as sequences of UTF-8 characters and/or bytes is the best possible way. Enforced UCS-4 rarely helps.

bytes vs. characters

Posted Apr 15, 2015 15:21 UTC (Wed) by david.a.wheeler (subscriber, #72896) [Link] (37 responses)

When you can be certain that all your input is perfectly formatted, the Python 3 string model is a good one. But the world isn't perfect. In particular, data sources routinely lie about their encoding, and Python 3 interferes with handling the real world instead of helping with it. For example, often there is no single encoding; many sources are a mishmash of UTF-8 and Windows-1252 and maybe some other stuff in a single file. What, exactly, is the encoding format of stdin? The answer is: there isn't one. What's the encoding format of filenames on Linux and Unix? There isn't one (they hacked around filenames, but failed to hack around ALL data sources, even though they all have this problem).

The "Unicode dammit" library helps. A little. But I find myself unable to find a reason to use Python 3, and I can find a long list of reasons to use Python 2 or some other language instead. I think I am not alone.

bytes vs. characters

Posted Apr 15, 2015 18:22 UTC (Wed) by njs (guest, #40338) [Link]

> In particular, data sources routinely lie about their encoding, and Python 3 interferes with handling the real world instead of helping with it.

I'm curious if you could elaborate on what interference you're thinking of? I don't have a dog in the fight or anything, but my experience with py3 has been pretty pleasant so far, and I don't see off the top of my head how py3 could do worse than py2 in this case. It seems like at worse you would end writing the same code in both cases to treat the data as bytes, try different encodings or whatever you want, with the main difference that in py3 at least you don't have to deal with random functions deciding to help out by spontaneously encoding/decoding with some random codec? Or depending on what you're doing, surrogate-escape could be pretty useful too, and that's a py3 feature.

bytes vs. characters

Posted Apr 15, 2015 18:35 UTC (Wed) by HelloWorld (guest, #56129) [Link] (2 responses)

> But the world isn't perfect. In particular, data sources routinely lie about their encoding, and Python 3 interferes with handling the real world instead of helping with it.
The world is a messy place because we make it one. It's the idiotic “be liberal in what you accept” doctrine that led us here, and the only way out is to not create more crap that tries to cope with bad input in “helpful” ways rather than simply rejecting bad input.

bytes vs. characters

Posted Apr 15, 2015 21:28 UTC (Wed) by tpo (subscriber, #25713) [Link] (1 responses)

> The world is a messy place because we make it one. It's the idiotic “be liberal in what you accept” doctrine that led us here, and the only way out is to not create more crap that tries to cope with bad input in “helpful” ways rather than simply rejecting bad input.

I think "Truth" and "right" are attributes of the powerful. Whereas "be liberal in what you accept" is an expression of humility, of the wish to serve.

I can see the point of standing up for a cause. But probably the cause must not be self serving in order to legitimize the use of the force of refusal.

Maybe.

Remember what consequences "being right" had during the browser wars? Or how "being right" is constructing walled gardens today?

bytes vs. characters

Posted Apr 16, 2015 13:22 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

> "be liberal in what you accept" is an expression of humility, of the wish to serve.

Excess rigidity is the key to maintaining a negligible user base.

bytes vs. characters

Posted Apr 16, 2015 21:33 UTC (Thu) by zyga (subscriber, #81533) [Link] (12 responses)

If you have mixed encoding just frelling use BYTES as that's what you are reading anyway. Bytes. Use bytes and be happy.

Don't say python3 is not practical for real world. It's like complaining that python has a int type and string type and you must use this confusing concept of using the right one at the right time while perl is so much better because it doesn't put this confusing non-real-life problem in front of you. That is totally missing the point.

People that stay stuck in python2 due to migration complexity are not the problem. They will eventually move. People that think python2 model of binary soup is somewhat superior or that you cannot achieve that in python3 need to get a clue.

bytes vs. characters

Posted Apr 16, 2015 22:36 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> If you have mixed encoding just frelling use BYTES as that's what you are reading anyway. Bytes. Use bytes and be happy.
Except that you can't do it.

For example, JSON decoder in Python3 _insists_ on decoding strings as strings. Even if they have invalid UTF-8 data. It's bad, but such services do exist out there and sometimes you have to work with them.

Ditto for HTTP headers.

bytes vs. characters

Posted Apr 16, 2015 23:15 UTC (Thu) by dbaker (guest, #89236) [Link] (1 responses)

> For example, JSON decoder in Python3 _insists_ on decoding strings as strings. Even if they have invalid UTF-8 data. It's bad, but such services do exist out there and sometimes you have to work with them.

Because the JSON spec requires that strings must be unicode?

Can't you just write a custom object_hook function to pass to the decoder to solve your problem?

bytes vs. characters

Posted Apr 16, 2015 23:19 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> Because the JSON spec requires that strings must be unicode?
Yes, but the reality outside is a little bit different.

> Can't you just write a custom object_hook function to pass to the decoder to solve your problem?
No, 'encoder' parameter is ignored in json.loads and everything else already gets decoded strings.

We simply switched to a third-party library instead.

bytes vs. characters

Posted Apr 17, 2015 7:15 UTC (Fri) by zyga (subscriber, #81533) [Link] (1 responses)

You can decode("UTF-8", "ignore") or something else to "coerce" it to some form of text though I really do value the sanity of that. Just fix your data sources. Even if you use some 3rd party library it's not going to make any of that "json"-like thing work with other libraries (I assume that other customers/APIs need to read it).

HTTP headers are a perfect example of binary data. Handling them as unicode text is broken IMHO. You can just use byte processing for everything there and Python 3.4, AFAIR, fixed some last gripes about lack of formatting support for edge cases like that.

bytes vs. characters

Posted Apr 17, 2015 7:22 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> You can decode("UTF-8", "ignore") or something else to "coerce" it to some form of text though I really do value the sanity of that.
I think it will still be broken. There's a workaround that simply stores binary bytes in the lower byte of UCS-4 codepoints and it sorta works.

I'd love to fix these data sources, but they're out of my control. The vendor knows about it and they plan to base64 binary data in the future, but for now I have to work with what I have.

> You can just use byte processing for everything there and Python 3.4
Not exactly. Most of the built-in library can be used with byte sequences, but third-party libraries are often too careless.

I've fixed tons of code like this:

>def blah(p):
> if fail_to_do_something(p):
> raise SomeException(u"Failed to frobnicate %s!" % p)

It mostly works as is, but occasionally it doesn't.

bytes vs. characters

Posted Apr 17, 2015 14:35 UTC (Fri) by intgr (subscriber, #39733) [Link] (6 responses)

How is it a problem in Python that it insists on data to conform to a specification?

How do you think it should behave? Always return bytes? No, JSON specifies Unicode. Try Unicode first but fall back to bytes? No, that seems like it would be very surprising behavior and cause more bugs than it solves. AFAICT the alternatives are far worse.

The situation you're in is *caused* by implementations being too lenient and accepting junk as input, which is why the vendor has not noticed this issue before.

bytes vs. characters

Posted Apr 18, 2015 11:03 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (5 responses)

So how do you put raw bytes into JSON then? How do you reasonably deal with people putting Latin-1 into comments on a website accessible via an API using JSON? Do you use an array of codepoints? Bytes? Personally, I like Python 2's json module which gives you a unicode object for UTF-8 and str for anything else (or pure ASCII). The problems there, however, come from other places such as subprocess.Popen.communicate choking on non-ASCII (so unicode object need encoded, but Latin-1 needed casted to bytes…which is the same as str, but still required for some reason), unicode strings can't be formatted with other unicode objects, but str can be (WTF logic is that?), and other pitfalls (of course, hidden until runtime, lucky me). It would be nice to move to Python 3 which seems like it fixes these problems, but this makes it sound like they just moved the ball around in some kind of shell game.

bytes vs. characters

Posted Apr 20, 2015 7:59 UTC (Mon) by zyga (subscriber, #81533) [Link] (4 responses)

You don't. There's no specification for putting random bytes in json.

At this time you just generate a stream of bytes (not python bytes, just bytes) that has some meaning that is only sensible to you and whoever consumes your byte stream. It's not json.

bytes vs. characters

Posted Apr 20, 2015 11:43 UTC (Mon) by Jonno (subscriber, #49613) [Link] (3 responses)

> You don't. There's no specification for putting random bytes in json.
Actually there is, it is called an "array of numbers", ie [ 97, 114, 114, 97, 121, 32, 111, 102, 32, 110, 117, 109, 98, 101, 114, 115, 0 ]

bytes vs. characters

Posted Apr 20, 2015 12:21 UTC (Mon) by bcopeland (subscriber, #51750) [Link]

For those, like me, who have a defect where they cannot resist converting ascii-valued hex or decimal numbers into their chr() equivalent, it all checks out.

bytes vs. characters

Posted Apr 20, 2015 13:19 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

The problem is that "no" implementations do this :( . Ruby's libraries (or at least whatever GitLab uses) will happily make non-utf-8 strings in the JSON it exports. And Python2 will happily import it as a str object. There is *lots* of code that will need changing for stricter parsers/generators.

bytes vs. characters

Posted Apr 21, 2015 4:12 UTC (Tue) by lsl (subscriber, #86508) [Link]

Go's JSON package does use those number arrays when appropriate.

bytes vs. characters

Posted Apr 17, 2015 18:27 UTC (Fri) by marcH (subscriber, #57642) [Link] (19 responses)

> For example, often there is no single encoding; many sources are a mishmash of UTF-8 and Windows-1252 and maybe some other stuff in a single file.

"Many"... how many? Sure, it happens every time I throw the random cr*p stored on my hard drive at a quick and dirty script I just hacked together. I think it's a small price to pay for type safety whenever you and I write proper, reliable software.

If you have a source with an hopelessly entangled mix of UTF-8 and Windows-1252, and had the freedom to re-design Python3 (or whatever else), what sensible could you possibly do with it *anyway*? Genuine question.

bytes vs. characters

Posted Apr 23, 2015 15:27 UTC (Thu) by lopgok (guest, #43164) [Link] (18 responses)

I would like a simple way in python 3 to be able to read the names of all the files in a directory. In python 3, it skips over some files which I suspect are not in the current codespace. In python 2, it just reads the names of all of the files.

I understand it is problematic to do string processing on oddly constructed strings, but it is mission critical for me to be able to see all the files in a directory. If an exception was raised it would really suck, but it would suck less than silently skipping file names that it didn't understand.

That is the reason I have not migrated all of my development to python 3.

bytes vs. characters

Posted Apr 23, 2015 17:00 UTC (Thu) by cesarb (subscriber, #6266) [Link] (4 responses)

> I would like a simple way in python 3 to be able to read the names of all the files in a directory. In python 3, it skips over some files which I suspect are not in the current codespace. In python 2, it just reads the names of all of the files.

I just tested here, and the python3 in this machine returns all filenames in os.listdir('.'), even the one I created with an invalid UTF-8 encoding.

Skipping over some files was true in Python 3.0 (https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-...):

"Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError."

(The same paragraph mentions that you could still use os.listdir(b'.') to get all filenames as bytes, so even with Python 3.0 you already had a way to read the name of all the files.)

But that was probably changed in Python 3.1, when PEP 383 (https://www.python.org/dev/peps/pep-0383/) was implemented, since with it there are no "filenames that cannot be decoded properly".

bytes vs. characters

Posted Apr 23, 2015 22:09 UTC (Thu) by lopgok (guest, #43164) [Link] (3 responses)

It is still broken with python 3 when I tested it about 2 or 3 months ago. It was either python 3.3 or python 3.4

I have a directory which is read just fine with python 2.7, but skips files with python 3.

bytes vs. characters

Posted Apr 24, 2015 11:44 UTC (Fri) by cesarb (subscriber, #6266) [Link] (2 responses)

Does it still skip files if you use the "bytes" interface (os.listdir(b'.'))?

I just took a quick look at the current Python source code for os.listdir (https://hg.python.org/cpython/file/151cab576cab/Modules/p...), and it only has code to skip the "." and ".." entries, as it's documented to do. In both the "str" and the "bytes" case, it adds every entry other than these two. For it to skip anything else on os.listdir, readdir() from glibc has to be skipping it, and it should affect more than just Python.

Or is the problem with something other than os.listdir?

bytes vs. characters

Posted Apr 24, 2015 14:18 UTC (Fri) by lopgok (guest, #43164) [Link] (1 responses)

It is os.listdir. I have not tried accessing it in binary yet.

I do find it odd that the OS can list the file and I can manipulate the file name on the command line, but because it has some odd characters in it, python silently skips over it.

bytes vs. characters

Posted May 9, 2015 21:34 UTC (Sat) by nix (subscriber, #2304) [Link]

Well, if you want to read something no matter what its encoding, you use bytes mode. That's what bytes mode is *for*. Python 3 is really very consistent here (unlike Python 2, for which you had to guess and hope.)

bytes vs. characters

Posted Apr 23, 2015 17:07 UTC (Thu) by marcH (subscriber, #57642) [Link] (12 responses)

> In python 3, it skips over some files which I suspect are not in the current codespace [...] but it is mission critical for me to be able to see all the files in a directory.

"Mission-critical" relying on filename garbage, mmm.... really? What kind of mission?

bytes vs. characters

Posted Apr 23, 2015 18:03 UTC (Thu) by zlynx (guest, #2285) [Link]

Silent failure is always a STUPID idea.

If Python3 really is silently ignoring invalid filenames then that should be marked as a critical flaw.

The real world is not a perfect place with perfectly encoded strings.

bytes vs. characters

Posted Apr 23, 2015 20:33 UTC (Thu) by lsl (subscriber, #86508) [Link] (6 responses)

Why would they be garbage?

bytes vs. characters

Posted Apr 23, 2015 20:47 UTC (Thu) by marcH (subscriber, #57642) [Link] (5 responses)

> Why would they be garbage?

How else is any invalid encoding displayed?

bytes vs. characters

Posted Apr 23, 2015 22:07 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

they are only invalid if you decide ahead of time that they are supposed to be UTF8 strings.

the spec allows them to be a string of bytes (excluding null and /), no encoding is required.

bytes vs. characters

Posted Apr 23, 2015 22:30 UTC (Thu) by marcH (subscriber, #57642) [Link]

> they are only invalid if you decide ahead of time that they are supposed to be UTF8 strings.

They are invalid if you decide that they are supposed to be in some encoding and some filename uses any *other* encoding. Then garbage gets displayed: for real.

https://en.wikipedia.org/wiki/Mojibake (search page for "garbage")

It's less rare with removable media or ID3
https://en.wikipedia.org/wiki/ID3 (search page for "mojibake")

> the spec allows them to be a string of bytes (excluding null and /), no encoding is required.

As far as filenames are concerned, you meant: *the lack of* a spec.

http://www.dwheeler.com/essays/fixing-unix-linux-filename... (search page for... "garbage")

> no encoding is required.

Which command do you typically use instead of "ls"? hexdump?

bytes vs. characters

Posted Apr 24, 2015 6:57 UTC (Fri) by mbunkus (subscriber, #87248) [Link] (2 responses)

It's not about the display of broken information. That's the easy part.

But a reliable tool, especially one running on filesystems where nearly everything goes (including newlines in them and having no discernible encoding at all), should be able to handle such files, too. This goes double for tools where the developer doesn't control the input. Backup software is the prime example.

How often such files happen? You'd be surprised… Email clients are still broken and annotate file names with the wrong character set resulting in broken file names when saving. ZIP files don't have any encoding information at all, so unpacking one with a file name containing non-ASCII characters often results in ISO encoded file names on my UTF-8 system. And so on.

Therefore treating file names as anything else than a sequence of bytes is, in general, a really bad idea. Only force encodings in places where you need that encoding; displaying the file name being the prime example. If you store it in a database use binary column formats (or if you must use hex then use some kind of escaping mechanism like URL encoding an UTF-8 representation). UTF-8 representations have their own problems regarding file names, think of normalization forms and the fun you're having with Macs and non-Macs.

Treating file names correctly is hard enough. Forcing them into any kind of encoding only makes it worse.

bytes vs. characters

Posted Apr 24, 2015 17:22 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> Only force encodings in places where you need that encoding; displaying the file name being the prime example.

Thanks a lot, this clarifies.

So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?

bytes vs. characters

Posted Apr 25, 2015 8:19 UTC (Sat) by peter-b (subscriber, #66996) [Link]

> So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?

Yes. listdir(x) where x is bytes returns the raw filenames as bytes.

https://docs.python.org/3.4/library/os.html?highlight=lis...

bytes vs. characters

Posted Apr 23, 2015 22:11 UTC (Thu) by lopgok (guest, #43164) [Link] (3 responses)

Mission critical is not hard real-time. Mission critical means the mission fails when the program fails to read files in a directory.

Some failures are just annoying, but this one is not for me.

bytes vs. characters

Posted Apr 23, 2015 22:40 UTC (Thu) by marcH (subscriber, #57642) [Link] (2 responses)

> Mission critical means the mission fails when the program fails to read files in a directory.

And the question was: what kind of mission relies on filename garbage?

It's not 100% clear if you actually care about the filenames, or not and just their content.

BTW I totally agree that silently skipping broken filenames is a massive bug.

bytes vs. characters

Posted Apr 24, 2015 0:44 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> And the question was: what kind of mission relies on filename garbage?
For example, a cloud storage client that is used to backup users' files.

bytes vs. characters

Posted May 9, 2015 21:36 UTC (Sat) by nix (subscriber, #2304) [Link]

That would be a backup program written by someone who doesn't understand the difference between Python's byte- and string-based interfaces. I wouldn't trust any backup program written by someone who didn't understand that! God only knows what it's doing to the file content...

Report from the Python Language Summit

Posted Apr 23, 2015 16:25 UTC (Thu) by littlevoice (guest, #102151) [Link] (1 responses)

Great group photo! Was the women's invitational event at a different venue?

Report from the Python Language Summit

Posted Apr 25, 2015 2:29 UTC (Sat) by njs (guest, #40338) [Link]

Yeah, they screwed this up. For whatever it's worth, the first thread on the langsummit list afterwards was a call-out of the decrease in gender diversity as compared to previous years and turned into a discussion about what they could do to improve matters in the future.