Report from the Python Language Summit
The Python Language Summit is an annual event that is held in conjunction with the North American edition of PyCon. Its mission is to bring together core developers of various Python implementations to discuss topics of interest within that group. The 2015 meeting was held April 8 in Montréal, Canada. I was happy to be invited to attend the summit so that I could bring readers a report on the discussions there.
The summit was deemed the "Barry and Larry show" by some, since it was co-chaired by Barry Warsaw and Larry Hastings (seen at right in their stylish fezzes). Somewhere around 50 developers sat in on the talks, which focused on a number of interesting topics, including atomicity guarantees for Python operations, possible plans to make Python 3 more attractive to developers, infrastructure changes for development, better measurement for Python 3 adoption, the /usr/bin/python symbolic link, type hints, and more.
- Atomicity: What operations are guaranteed to be atomic for Python and where/how will that be specified?
- Making Python 3 more attractive: Adding some big-ticket features might make developers more interested in switching to Python 3.
- PyParallel: An alternative Python focused on high performance through parallelism.
- Core development infrastructure: Making Python development workflows and infrastructure better for the future.
- Python 3 adoption: More ideas on what needs to happen to bring about more Python 3, but also to be able to measure that increase.
- The Python symbolic link: Should /usr/bin/python point to Python 2 or Python 3—or perhaps to something else entirely?
- Type hints: Guido van Rossum gives an introduction to the new optional type annotation feature for Python 3.5.
- Python on mobile systems: A video that described the current status and plans for writing mobile apps using Python.
- Adding Requests to the standard library: Should the Requests module be added to the standard library?
- PyMetabiosis: An experimental way to allow PyPy to use C extensions by embedding CPython in PyPy.
- Jython Native Interface (JyNI): A mechanism to allow Jython to use C extensions.
- Python installation options for Windows: A look at the Windows installer and some options for the types of installations it supports in the future.
- Python at Heroku: A talk about where Python fits in at cloud-application hosting provider Heroku.
Index entries for this article | |
---|---|
Conference | PyCon/2015 |
Python | Python Language Summit |
Posted Apr 15, 2015 0:19 UTC (Wed)
by smoogen (subscriber, #97)
[Link] (53 responses)
And yes this is probably an incredibly stupid and foolish thing to suggest. I just don't see an easier path due to the complexity and deep fundamental changes required.
Posted Apr 15, 2015 0:23 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (52 responses)
Posted Apr 15, 2015 12:56 UTC (Wed)
by nix (subscriber, #2304)
[Link] (51 responses)
Posted Apr 15, 2015 14:22 UTC (Wed)
by jezuch (subscriber, #52988)
[Link] (12 responses)
Frankly, this usefulness is rather obvious to anyone living in non-ASCII world. When you are using ą, ę, ź, ć, etc. daily (and it's still just a variant of the Latin script), you learn pretty quickly about character encodings. The hard way, usually ;)
Posted Apr 15, 2015 15:01 UTC (Wed)
by cesarb (subscriber, #6266)
[Link]
It's even more useful when diacritics are rare. If you use diacritics all the time, encode/decode bugs are found quickly. If you don't, encode/decode bugs will remain hidden, until one day they blow up as one of the dreaded Unicode encode/decode exceptions (or worse, show up as mojibake).
Posted Apr 15, 2015 20:25 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (10 responses)
Yet I find the whole char/byte distinction to be extremely moronic.
Posted Apr 16, 2015 11:34 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (5 responses)
Posted Apr 16, 2015 18:11 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
The only sane and modern way to do Unicode is UTF-8.
Posted Apr 16, 2015 19:27 UTC (Thu)
by HelloWorld (guest, #56129)
[Link] (1 responses)
> The only sane and modern way to do Unicode is UTF-8.
Posted Apr 16, 2015 22:40 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.
Posted Apr 20, 2015 10:38 UTC (Mon)
by niner (subscriber, #26151)
[Link]
Characters? I'd say one. I can definitely say (as far as I understand this anyway) that it's one grapheme and one or more code points.
Perl 6 will deal with strings as sequences of Normalized Form Graphemes (NFG). There's a very interested blog post about what this means:
https://6guts.wordpress.com/2015/04/12/this-week-unicode-...
I guess the only two sane ways of handling Unicode are:
Posted Apr 21, 2015 22:26 UTC (Tue)
by nix (subscriber, #2304)
[Link]
No, not everything can use UTF-8, even in an ideal world. And such systems will *always* use different encodings, so to talk to such systems Python's enforced conversion is extremely valuable. And even when you're not, and the system you are talking to uses UTF-8 or some other Unicode variant, the enforced conversion is *still* valuable because it forces you to think about what encoding is in use, and amazingly often it's not straight UTF-8, or it's UTF-8 with extra requirements such as needing to be canonicalized or decanonicalized in a particular way or "oops we didn't say but experimentation makes it clear that $strange_canonicalization is the only way to go". (I have seen all of these on real systems, along with people claiming UTF-8 but meaning UCS-16 because they didn't know there was a difference, and vice versa -- and, in the latter cases, cursed them.)
Posted Apr 16, 2015 21:28 UTC (Thu)
by flussence (guest, #85566)
[Link] (3 responses)
It is, but I'd say because nobody has a clue how to define "char". It can mean all sorts of things depending on where it's used:
* 1 byte in a legacy encoding (or C)
Posted Apr 16, 2015 22:38 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
That's why I violently oppose the definition: "UCS-4 codepoint is a character or GTFO", which Python3 tries to enforce.
Posted Apr 21, 2015 22:31 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Apr 21, 2015 23:06 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I still think that treating strings as sequences of UTF-8 characters and/or bytes is the best possible way. Enforced UCS-4 rarely helps.
Posted Apr 15, 2015 15:21 UTC (Wed)
by david.a.wheeler (subscriber, #72896)
[Link] (37 responses)
When you can be certain that all your input is perfectly formatted, the Python 3 string model is a good one. But the world isn't perfect. In particular, data sources routinely lie about their encoding, and Python 3 interferes with handling the real world instead of helping with it. For example, often there is no single encoding; many sources are a mishmash of UTF-8 and Windows-1252 and maybe some other stuff in a single file. What, exactly, is the encoding format of stdin? The answer is: there isn't one. What's the encoding format of filenames on Linux and Unix? There isn't one (they hacked around filenames, but failed to hack around ALL data sources, even though they all have this problem).
The "Unicode dammit" library helps. A little. But I find myself unable to find a reason to use Python 3, and I can find a long list of reasons to use Python 2 or some other language instead. I think I am not alone.
Posted Apr 15, 2015 18:22 UTC (Wed)
by njs (guest, #40338)
[Link]
I'm curious if you could elaborate on what interference you're thinking of? I don't have a dog in the fight or anything, but my experience with py3 has been pretty pleasant so far, and I don't see off the top of my head how py3 could do worse than py2 in this case. It seems like at worse you would end writing the same code in both cases to treat the data as bytes, try different encodings or whatever you want, with the main difference that in py3 at least you don't have to deal with random functions deciding to help out by spontaneously encoding/decoding with some random codec? Or depending on what you're doing, surrogate-escape could be pretty useful too, and that's a py3 feature.
Posted Apr 15, 2015 18:35 UTC (Wed)
by HelloWorld (guest, #56129)
[Link] (2 responses)
Posted Apr 15, 2015 21:28 UTC (Wed)
by tpo (subscriber, #25713)
[Link] (1 responses)
I think "Truth" and "right" are attributes of the powerful. Whereas "be liberal in what you accept" is an expression of humility, of the wish to serve.
I can see the point of standing up for a cause. But probably the cause must not be self serving in order to legitimize the use of the force of refusal.
Maybe.
Remember what consequences "being right" had during the browser wars? Or how "being right" is constructing walled gardens today?
Posted Apr 16, 2015 13:22 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Excess rigidity is the key to maintaining a negligible user base.
Posted Apr 16, 2015 21:33 UTC (Thu)
by zyga (subscriber, #81533)
[Link] (12 responses)
Don't say python3 is not practical for real world. It's like complaining that python has a int type and string type and you must use this confusing concept of using the right one at the right time while perl is so much better because it doesn't put this confusing non-real-life problem in front of you. That is totally missing the point.
People that stay stuck in python2 due to migration complexity are not the problem. They will eventually move. People that think python2 model of binary soup is somewhat superior or that you cannot achieve that in python3 need to get a clue.
Posted Apr 16, 2015 22:36 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
For example, JSON decoder in Python3 _insists_ on decoding strings as strings. Even if they have invalid UTF-8 data. It's bad, but such services do exist out there and sometimes you have to work with them.
Ditto for HTTP headers.
Posted Apr 16, 2015 23:15 UTC (Thu)
by dbaker (guest, #89236)
[Link] (1 responses)
Because the JSON spec requires that strings must be unicode?
Can't you just write a custom object_hook function to pass to the decoder to solve your problem?
Posted Apr 16, 2015 23:19 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Can't you just write a custom object_hook function to pass to the decoder to solve your problem?
We simply switched to a third-party library instead.
Posted Apr 17, 2015 7:15 UTC (Fri)
by zyga (subscriber, #81533)
[Link] (1 responses)
HTTP headers are a perfect example of binary data. Handling them as unicode text is broken IMHO. You can just use byte processing for everything there and Python 3.4, AFAIR, fixed some last gripes about lack of formatting support for edge cases like that.
Posted Apr 17, 2015 7:22 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I'd love to fix these data sources, but they're out of my control. The vendor knows about it and they plan to base64 binary data in the future, but for now I have to work with what I have.
> You can just use byte processing for everything there and Python 3.4
I've fixed tons of code like this:
>def blah(p):
It mostly works as is, but occasionally it doesn't.
Posted Apr 17, 2015 14:35 UTC (Fri)
by intgr (subscriber, #39733)
[Link] (6 responses)
How do you think it should behave? Always return bytes? No, JSON specifies Unicode. Try Unicode first but fall back to bytes? No, that seems like it would be very surprising behavior and cause more bugs than it solves. AFAICT the alternatives are far worse.
The situation you're in is *caused* by implementations being too lenient and accepting junk as input, which is why the vendor has not noticed this issue before.
Posted Apr 18, 2015 11:03 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (5 responses)
Posted Apr 20, 2015 7:59 UTC (Mon)
by zyga (subscriber, #81533)
[Link] (4 responses)
At this time you just generate a stream of bytes (not python bytes, just bytes) that has some meaning that is only sensible to you and whoever consumes your byte stream. It's not json.
Posted Apr 20, 2015 11:43 UTC (Mon)
by Jonno (subscriber, #49613)
[Link] (3 responses)
Posted Apr 20, 2015 12:21 UTC (Mon)
by bcopeland (subscriber, #51750)
[Link]
Posted Apr 20, 2015 13:19 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Apr 21, 2015 4:12 UTC (Tue)
by lsl (subscriber, #86508)
[Link]
Posted Apr 17, 2015 18:27 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (19 responses)
"Many"... how many? Sure, it happens every time I throw the random cr*p stored on my hard drive at a quick and dirty script I just hacked together. I think it's a small price to pay for type safety whenever you and I write proper, reliable software.
If you have a source with an hopelessly entangled mix of UTF-8 and Windows-1252, and had the freedom to re-design Python3 (or whatever else), what sensible could you possibly do with it *anyway*? Genuine question.
Posted Apr 23, 2015 15:27 UTC (Thu)
by lopgok (guest, #43164)
[Link] (18 responses)
I understand it is problematic to do string processing on oddly constructed strings, but it is mission critical for me to be able to see all the files in a directory. If an exception was raised it would really suck, but it would suck less than silently skipping file names that it didn't understand.
That is the reason I have not migrated all of my development to python 3.
Posted Apr 23, 2015 17:00 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (4 responses)
I just tested here, and the python3 in this machine returns all filenames in os.listdir('.'), even the one I created with an invalid UTF-8 encoding.
Skipping over some files was true in Python 3.0 (https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-...):
"Note that when os.listdir() returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising UnicodeError."
(The same paragraph mentions that you could still use os.listdir(b'.') to get all filenames as bytes, so even with Python 3.0 you already had a way to read the name of all the files.)
But that was probably changed in Python 3.1, when PEP 383 (https://www.python.org/dev/peps/pep-0383/) was implemented, since with it there are no "filenames that cannot be decoded properly".
Posted Apr 23, 2015 22:09 UTC (Thu)
by lopgok (guest, #43164)
[Link] (3 responses)
I have a directory which is read just fine with python 2.7, but skips files with python 3.
Posted Apr 24, 2015 11:44 UTC (Fri)
by cesarb (subscriber, #6266)
[Link] (2 responses)
I just took a quick look at the current Python source code for os.listdir (https://hg.python.org/cpython/file/151cab576cab/Modules/p...), and it only has code to skip the "." and ".." entries, as it's documented to do. In both the "str" and the "bytes" case, it adds every entry other than these two. For it to skip anything else on os.listdir, readdir() from glibc has to be skipping it, and it should affect more than just Python.
Or is the problem with something other than os.listdir?
Posted Apr 24, 2015 14:18 UTC (Fri)
by lopgok (guest, #43164)
[Link] (1 responses)
I do find it odd that the OS can list the file and I can manipulate the file name on the command line, but because it has some odd characters in it, python silently skips over it.
Posted May 9, 2015 21:34 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Apr 23, 2015 17:07 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (12 responses)
"Mission-critical" relying on filename garbage, mmm.... really? What kind of mission?
Posted Apr 23, 2015 18:03 UTC (Thu)
by zlynx (guest, #2285)
[Link]
If Python3 really is silently ignoring invalid filenames then that should be marked as a critical flaw.
The real world is not a perfect place with perfectly encoded strings.
Posted Apr 23, 2015 20:33 UTC (Thu)
by lsl (subscriber, #86508)
[Link] (6 responses)
Posted Apr 23, 2015 20:47 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (5 responses)
How else is any invalid encoding displayed?
Posted Apr 23, 2015 22:07 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
the spec allows them to be a string of bytes (excluding null and /), no encoding is required.
Posted Apr 23, 2015 22:30 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
They are invalid if you decide that they are supposed to be in some encoding and some filename uses any *other* encoding. Then garbage gets displayed: for real.
https://en.wikipedia.org/wiki/Mojibake (search page for "garbage")
It's less rare with removable media or ID3
> the spec allows them to be a string of bytes (excluding null and /), no encoding is required.
As far as filenames are concerned, you meant: *the lack of* a spec.
http://www.dwheeler.com/essays/fixing-unix-linux-filename... (search page for... "garbage")
> no encoding is required.
Which command do you typically use instead of "ls"? hexdump?
Posted Apr 24, 2015 6:57 UTC (Fri)
by mbunkus (subscriber, #87248)
[Link] (2 responses)
But a reliable tool, especially one running on filesystems where nearly everything goes (including newlines in them and having no discernible encoding at all), should be able to handle such files, too. This goes double for tools where the developer doesn't control the input. Backup software is the prime example.
How often such files happen? You'd be surprised… Email clients are still broken and annotate file names with the wrong character set resulting in broken file names when saving. ZIP files don't have any encoding information at all, so unpacking one with a file name containing non-ASCII characters often results in ISO encoded file names on my UTF-8 system. And so on.
Therefore treating file names as anything else than a sequence of bytes is, in general, a really bad idea. Only force encodings in places where you need that encoding; displaying the file name being the prime example. If you store it in a database use binary column formats (or if you must use hex then use some kind of escaping mechanism like URL encoding an UTF-8 representation). UTF-8 representations have their own problems regarding file names, think of normalization forms and the fun you're having with Macs and non-Macs.
Treating file names correctly is hard enough. Forcing them into any kind of encoding only makes it worse.
Posted Apr 24, 2015 17:22 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Thanks a lot, this clarifies.
So the core issue seems to be: the filename being the only file handle. Lose the name and you lose the file. I agree it shouldn't be like this. For instance you can have an iterator that returns some opaque FileObject that does not really care about the name. Does Python have this?
Posted Apr 25, 2015 8:19 UTC (Sat)
by peter-b (subscriber, #66996)
[Link]
Yes. listdir(x) where x is bytes returns the raw filenames as bytes.
https://docs.python.org/3.4/library/os.html?highlight=lis...
Posted Apr 23, 2015 22:11 UTC (Thu)
by lopgok (guest, #43164)
[Link] (3 responses)
Some failures are just annoying, but this one is not for me.
Posted Apr 23, 2015 22:40 UTC (Thu)
by marcH (subscriber, #57642)
[Link] (2 responses)
And the question was: what kind of mission relies on filename garbage?
It's not 100% clear if you actually care about the filenames, or not and just their content.
BTW I totally agree that silently skipping broken filenames is a massive bug.
Posted Apr 24, 2015 0:44 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted May 9, 2015 21:36 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Apr 23, 2015 16:25 UTC (Thu)
by littlevoice (guest, #102151)
[Link] (1 responses)
Posted Apr 25, 2015 2:29 UTC (Sat)
by njs (guest, #40338)
[Link]
Report from the Python Language Summit
Report from the Python Language Summit
Report from the Python Language Summit
Report from the Python Language Summit
Report from the Python Language Summit
Report from the Python Language Summit
Report from the Python Language Summit
Yes, you have stated that many times, and the response is still the same: You're confused. Bytes aren't Characters and Characters aren't bytes, period. It's as simple as that.
Report from the Python Language Summit
Report from the Python Language Summit
What does that have to do with the fact that bytes are not text/strings/characters?
Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.
Report from the Python Language Summit
The fact that sequences of UCS-4 codepoints are also not text/string/characters, just as sequences of raw bytes.
Python3 practically forces one to transcode data from one format to another all the time for no specific reason.
Report from the Python Language Summit
* be completely agnostic and treat stings as opaque sequences of bytes, or
* go all in and work with graphemes whenever possible.
Report from the Python Language Summit
Report from the Python Language Summit
* If you're using a half-baked library, it's 2 or 4 bytes UTF-16. (Qt4 falls into the "completely baked" category as it'll let you backspace over half an emoji character.)
* If you're lucky, someone actually implemented Unicode correctly and a "character" is a variable length sequence of bytes encoding a full ISO-10646 codepoint. Such as U+00C7, which is the character "Ç". Or U+0327, which is a squiggly line and definitely not a character.
* The only sensible and correct definition: a character is the thing you would write by hand — "Ḉ".length == "Ḉ".length == "Ḉ".length — almost no software uses this definition.
Report from the Python Language Summit
Correct. And complex scripts or complex characters make it even more complicated.
Report from the Python Language Summit
Report from the Python Language Summit
bytes vs. characters
bytes vs. characters
bytes vs. characters
The world is a messy place because we make it one. It's the idiotic “be liberal in what you accept” doctrine that led us here, and the only way out is to not create more crap that tries to cope with bad input in “helpful” ways rather than simply rejecting bad input.
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
Except that you can't do it.
bytes vs. characters
bytes vs. characters
Yes, but the reality outside is a little bit different.
No, 'encoder' parameter is ignored in json.loads and everything else already gets decoded strings.
bytes vs. characters
bytes vs. characters
I think it will still be broken. There's a workaround that simply stores binary bytes in the lower byte of UCS-4 codepoints and it sorta works.
Not exactly. Most of the built-in library can be used with byte sequences, but third-party libraries are often too careless.
> if fail_to_do_something(p):
> raise SomeException(u"Failed to frobnicate %s!" % p)
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
Actually there is, it is called an "array of numbers", ie [ 97, 114, 114, 97, 121, 32, 111, 102, 32, 110, 117, 109, 98, 101, 114, 115, 0 ]
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
https://en.wikipedia.org/wiki/ID3 (search page for "mojibake")
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
bytes vs. characters
For example, a cloud storage client that is used to backup users' files.
bytes vs. characters
Report from the Python Language Summit
Report from the Python Language Summit