Modern Python performance considerations
There is a lot of work going on right now on speeding up Python; Kevin Modzelewski gave a presentation at PyCon 2022 on some of that work. Much of it has implications for Python programmers in terms of how to best take advantage of these optimizations in their code. He gave an overview of some of the projects, the kinds of optimizations being worked on, and provided some benchmarks to give a general idea of how much faster various Python implementations are getting—and which operations are most affected.
Modzelewski works at Anaconda on the Pyston "optimized Python interpreter". He wanted to focus on "modern Python" in the talk; there are lots of tips about how to speed up Python code available, but many of those are "not quite as useful anymore". There are some new tips, however, that can be used with these up and coming optimized implementations, which he wanted to talk about.
Why Python is slow
The first topic he raised, "why Python is slow", is somewhat divisive, he said; everyone seems to have a different view on that, but he would be presenting his personal views on it. This is not the first time we have reported on a talk by Modzelewski on this subject; he spoke at the 2016 Python Language Summit on the question of why the language is slow, along with a bit about Pyston.
The most common reason given is that interpreted languages are slow and, since Python is interpreted, it is slow, but that is not what he has found. In his measurements of web servers, the overhead of interpretation is about 10%; that is "significant, and people don't want it, but it doesn't explain why Python can be ten to 100 times slower than C". In order to explain that, you need to look at the dynamic nature of Python.
"Python is a very dynamic language", in lots of different ways, but he wanted to focus on just a few of those in the talk. The first is perhaps the most obvious, the interpreter does not know the types of any of the variables. That means any operation on a variable needs to look up what type it is in order to figure out how to perform the operation on that type.
The second dynamic behavior is in variable lookups, though that is a less obvious one. For example, "print" is not a keyword in Python, so a simple print statement needs to look up the value for the "print" name in order to call it as a function. In particular, it has to determine whether the name has been overridden by anything in the scope and it has to do that every time print() is called. The same goes things like len(), int(), and many more built-in functions; all of them require expensive lookups every time they are used.
The third aspect is Python's dynamic attribute lookup. In general, even in a single class, the interpreter does not know what attributes exist on an object of that class. That means it needs a dynamic representation for the object to track the attributes. Looking up the attributes in Python is pretty fast, but still much slower than it would be if the list of attributes were static.
There are three separate projects that are currently being worked on to try to speed up Python in various ways; all of them are "coming out in various forms either last year or this year". There is his project, Pyston, which is a fork of CPython, the Faster CPython project that is being worked on in the main CPython repository by a team at Microsoft, and there is Cinder, which is a fork of CPython that is being worked on at Instagram. All of them are available now in one form or another.
Benchmarks
The "controversial slide" of his talk did not look like much at the outset, as it was just an empty table. He would be filling it in with his benchmarks of various projects over the next part of the talk "with a lot of disclaimers". The first controversial piece of that is the choice of which benchmarks to use to analyze performance.
There is the well-established pyperformance benchmark, which is "nice in a lot of ways". It is a semi-standard that is used by a lot of people for reporting Python benchmark numbers. In his experience, though, it tends to overstate performance benefits, so he likes to look more at application code for benchmarks.
To that end, he wrote a benchmark using the Flask web framework. He chose Flask because it is one of the simpler Python web frameworks, so he thought he would be able to get more of the projects working with it. He would be showing results for both of those benchmarks for several different projects.
The "next controversial thing" is which CPython version to choose as a baseline to measure the others against. He chose Python 3.8 because that is what Pyston is based on; all of the comparison numbers he presented were in relation to that baseline. He used the Ubuntu version of Python 3.8 because it is one of the faster builds of that version of Python. He was surprised to find that different builds were significantly faster or slower than others; Ubuntu is fast, while the macOS and Windows builds are slow.
"I get to list my project first", he said with a grin. He measured Pyston 2.3.2 as 62% faster on pyperformance, but only 34% faster on the flask benchmark. Those numbers are quite different, obviously, and he was not claiming that one benchmark was more accurate than the other. It just shows that it is important to choose a benchmark that is more representative of the kinds of programs you will be running.
He moved on to Python 3.11a7 from early April, which includes most of the Faster CPython work. "They also show good improvements on both of these numbers." On pyperformance, it was 15% faster; 10% faster for flask. The Faster CPython folks are reporting a different number for pyperformance, 25%, but that is not what he measured; "I don't know exactly where the difference is".
Cinder does not have releases, so he just grabbed the code from GitHub and built it. He got strange numbers that showed a marked decrease in performance compared to Python 3.8 (-51% for pyperformance and -18% for flask). He put question marks next to those because he does not believe they are real numbers; Instagram is using it internally and he doubts they would be using something slower. He wondered if perhaps there were patches that were not yet released into the GitHub tree.
He also benchmarked two other projects, both of which use just-in-time (JIT) compilers, PyPy and Pyjion. PyPy is fairly well-known, while Pyjion is less so, but he was curious to see the measurements for them. PyPy 7.3.9 is not able to run pyperformance because it does not support all of its dependencies and it was 36% slower on flask, which he believes reflects the different set of tradeoffs that PyPy has made. Pyjion was effectively a bust since he measured it at 1000 times slower on flask. That number got a double question mark because he does not think that reflects the numbers that the project is getting, but he "did not have time to sort all these things out before the talk, unfortunately".
What is being done
In a variety of ways, these projects are addressing the problems he identified that contribute to making Python slow. The interpretation overhead is being addressed in projects like Pyston and Cinder by way of JIT compilers. They convert the Python code as it is running into assembly instructions; "this sort of definitionally gets rid of interpretation overhead". While JIT compilation is interesting from a technical perspective, he would not be talking much about it because its gains come for free; the 10% overhead is nice to get back, but programmers cannot affect it much. Changing your code will not really make much of a difference one way or the other to the performance gains that JIT compilation brings, he said.
In what he called a "sweeping generalization", Modzelewski said that the three main projects he was focusing on were applying the same "bread and butter technique" in a variety of ways. Those techniques are based on the idea that most code "does not use the full dynamic power that it could at any given time" and that Python can quickly check to see if they are using the dynamic features. If those features are not being used, the language can do something fast instead of following the slower path needed to handle them.
That is the source of most of the speedups shown in the benchmarks. It sounds great, he said: "Python has dynamic features but you are not paying for them if you are not using them anymore." You can turn that statement around, however: you are paying for the dynamic features if you use them. Those features used to come for free, because you paid for them whether you used them or not.
But that situation has changed. You do not need to avoid dynamic features, and code will still get performance improvements if you continue to use them. But if you want to get the best performance possible, thinking about these things, and avoiding those features where possible, will make your code even faster, Modzelewski said.
Examples
The penalty for looking up built-ins (and other global variables) that he described at the beginning of the talk is one of the areas that has been optimized. If the code is using lots of print() or len() calls, for example, these newer Pythons speculate that that they have not been reassigned since the last time the lookup was done. It is easy in CPython to know whether any global variable has been reassigned since the last time the lookup was done; if not, the value of the lookup has to be the same as it was the last time. He showed two function definitions to demonstrate what he meant by reassignment:
def f(): global l l = [] # slow: print() def f(): global l l.append(1) # not slow: print()In the first function, an explicitly global variable has been reassigned, which means that the slower path needs to be used to look up print(). In the second function, l has simply been mutated, which does not affect the speed since no global reassignment has been done.
He showed his measurements of a benchmark both with and without reassignments. For Python 3.8, the times were the same (12.3ns), which indicates that the price is paid in either case. For Pyston, there was a sizable reduction for the case with reassignments (9.5ns) and a huge boost for the case without them (1.7ns). Python 3.11a7 had a nearly two-fold increase in speed even with reassignments (6.4ns), and a less dramatic drop from there without them (5.9ns).
He cautioned that the numbers should not be taken too literally as he thinks they will evolve rapidly. He was a bit surprised by the measurements and suspects that the Faster CPython team will get some ideas from them as well. But the overall conclusion is that, in modern Python, not reassigning global variables will make the rest of the code run faster. He suggested that any needed global mutable state be stored in an object if faster performance is the goal.
Attribute lookup is similar in some ways. In general, an object's attributes are stored in a dictionary, which has a fast hash table implementation in Python, but it is still slower than in C where a value can be retrieved using a pointer. An individual attribute lookup is not terribly slow, but Python programs do a lot of them so it adds up.
The technical details are rather complex, he said, but at a high level the same idea is being applied to attributes: speculating that if a lookup "looks the same" as the previous one, it can be executed the same way it was done before. He showed two ways that changing an object's "shape" will affect its run-time performance:
# different shape class Cls: pass obj1 = Cls() obj2 = Cls() obj1.x = 1 obj2.y = 2 # type mutated class Cls: pass obj = Cls() obj.x = 1 Cls.y = 2In the first case, attribute lookup on the two objects will be slow for the rest of the program once those statements have been executed. There are a lot of ways that a class can intercept the lookup of its attributes, but they are not usually used; the interpreter can know that those interceptions have not been used before, but once the class itself is changed, that situation may have changed. In the second case, changing the class means that the current fast path for class attribute lookup has to be bypassed because the interpreter cannot know whether the change affects attribute lookups.
He made a benchmark to measure the two cases above and the "happy case" where neither of those was done. He reported those numbers, which showed that both Pyston and Faster CPython improved things considerably in nearly all of the cases, with the happy case showing roughly 6x speedup for Pyston and 3x for Python 3.11a7. The baseline measurement showed that the cost was much the same for all three, which demonstrates that the price of doing these kinds of things is always being paid.
Once again, those numbers are going to change over time, but the general idea is that avoiding those kinds of changes will improve the performance of programs. Changing the shape of the object is the worst of the two and the code where he has seen that being done looks to him like it was meant to be a memory-saving technique. But doing so forces the interpreter to use a less-efficient representation for the object, so that savings is illusory. In general, code should set attributes with the same names on all objects of the same class, and do so in the same order, if it can. In passing, he noted that using __slots__ is now the fastest way to handle attributes on classes in Python.
Method calls are a special-case of attribute lookup where the attribute's value is immediately used to call a function. There is some old advice that if you are repeatedly doing a method call, say in a loop, that retrieving the method once before the loop and caching it to use inside the loop is a way to get better performance. For Python 3.8, there is a noticeable improvement of about 66% when doing so, but the newer Pythons actually see a performance degradation.
The reason is that method calls is one of the areas where optimizations have been focused and, in general, the new Pythons "want to see more of your code at once to optimize more of it, especially in this particular case". But caching the method outside of the loop will mean that those optimizations no longer apply. That is most true for built-in types; for methods on Python classes, there is still an improvement for caching the method lookup, but it is much smaller for all three of the interpreters measured.
He also measured lookups for functions in modules. He chose math.sqrt() because it is effectively just a single instruction, so everything measured is the overhead of the lookup and call. There is an improvement, especially in Python 3.8 (86%), for caching that lookup, but it is fairly modest for the others (roughly 15% for both). Maybe that 15% is enough, he said, but it is quite a bit smaller than before. For more typical calls from modules that actually perform some work, the savings is even more modest in all cases.
As attribute lookups get faster, the benefits of this caching technique get smaller, Modzelewski said. His personal advice is to stop caching method and function lookups; it is not worth the mental overhead and readability hit. As these optimized Pythons get smarter, the savings will get even smaller. He did not think that kind of code should be removed from existing programs but thinks that particular piece of advice can be left by the wayside going forward.
Other considerations
There are dynamic features in Python that are expensive now, but will likely get even more expensive over time. He did not go into any detail about them, but pointed out that using some of them may have wide-ranging effects because they inhibit some of the optimizations that are being added. So they are not just expensive where you are using them, but they might affect the performance of other parts of the code.
A problem that the community needs to address is that attaching a profiler to a Python program effectively disables almost all of the optimizations. At least that is true for Pyston; he is not sure if it is the case for Faster CPython or Cinder. It is, in general, a hard problem, he said; developers may be profiling a different version of the code than will actually be running. There may need to be a different profiling API or some other way to solve that problem.
The last thing Modzelewski wanted to talk about was C extensions, which are generally used either for bindings to another language or for providing better performance. The common wisdom is to use Cython or some other mechanism to convert performance-critical code to C, "but this situation is getting pretty murky now". All of the optimizations that he had been talking about currently only apply to Python code, so C extensions have a certain set of optimizations, while the interpreter has a different set. So which is going to improve performance depends on which set is going to help your code the most.
It is hard to give a good rubric for when to choose one over the other, but he converted his attribute lookup benchmark to a C extension using Cython to see. It showed a good improvement over Python 3.8, but was far worse for Pyston. He apparently was unable to measure Faster CPython but did not say why that was. He noted that Cython does not do any of the optimizations he had been talking about, but there is no barrier to doing so that he is aware of, so those could be adopted by Cython over time.
His feeling is that object-oriented code is going to be helped more by the new interpreters, while numeric code will continue to be improved using C extensions. It is something that developers will need to verify for themselves, however, as the situation is rather complicated right now. Unfortunately there is not a lot of help available to guide developers toward writing more performant Python; using these tips, doing experiments, and benchmarking is the way forward at this point.
The overall goal of the new optimizations is to not make Python code pay for dynamic features that it is not using. That is great, he concluded, but it adds new complexity to the decisions programmers will need to make when they are trying to squeeze the best performance out of their code. Avoiding unneeded dynamic features, and finding other ways to accomplish the same goals, is generally the new path to follow, though.
[I would like to thank LWN subscribers for supporting my trip to Salt Lake City for PyCon.]
Index entries for this article | |
---|---|
Conference | PyCon/2022 |
Python | Performance |
Posted May 4, 2022 22:53 UTC (Wed)
by bluss (subscriber, #47454)
[Link]
Posted May 4, 2022 23:50 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
Just to clarify: This is referring to the *global* (module-level) scope. Python does not search local or enclosing scopes at runtime; that portion of name resolution is instead done at compile time and the compiler (usually) emits LOAD_FAST bytecode instructions if it determines that a variable is local (enclosing variables are more complicated because you have to set up a closure object). As a result, it is possible to dynamically create new global variables at runtime, but local/enclosing variables can only be created at compile time.
> It is hard to give a good rubric for when to choose one over the other, but he converted his attribute lookup benchmark to a C extension using Cython to see. It showed a good improvement over Python 3.8, but was far worse for Pyston. He apparently was unable to measure Faster CPython but did not say why that was. He noted that Cython does not do any of the optimizations he had been talking about, but there is no barrier to doing so that he is aware of, so those could be adopted by Cython over time.
This is not terribly surprising to me. Cython actually *can* perform these optimizations, but it normally does so by transpiling to C, which results in non-equivalent language semantics. So you have to add special (nonstandard) type annotations to tell Cython that it's OK to change the semantics in that way (or else it emits Python/C API calls, which are strictly equivalent to the interpreter's behavior, but slower). I can't speak for them, so I can't say whether they will have any interest in adding "automatic" optimizations for pure Python code, but anything's possible.
Posted May 5, 2022 4:26 UTC (Thu)
by nimisht (subscriber, #128741)
[Link]
It can do other stuff like cross language optimization but its bread and butter is stuff like speculation and polymorphic inline caching.
I wonder how much faster python could be if it had the resources of v8 behind it
Posted May 7, 2022 3:26 UTC (Sat)
by Fowl (subscriber, #65667)
[Link] (1 responses)
I guess their respective JITs aren't tuned for the dynamism of python and the advantage would be more around interoperability with the respective ecosystems - I'm sure it's easier (and more portable, safe) to drop into C# or java/kotlin for something performance sensitive than C. Compat with the existing C extensions would be an issue too I guess.
Posted May 7, 2022 18:20 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
Modern Python performance considerations
Modern Python performance considerations
Modern Python performance considerations
Modern Python performance considerations
Modern Python performance considerations