-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Memory segfaults after V2 upgrade #7211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this. The most likely explanation for this is that pydantic v2 is using more memory than v1, and when you exceed the memory available seg faults occur. Maybe try running checking the memory output right before the segfault, or running fewer workers and see if the seg faults stop. I don't know of any other reason why pydantic V2 should segfault, if you can give us more detail, we'll investigate immediately. |
Hey @samuelcolvin ! We run one experiment on prod to collect more info about segmentaion faults. We captured 3 cases, two of them segfault occured when current threads were holding GIL and run GarbageCollection cycle and one just holding the GIL. We also got core dumps from the machine, but it doesn't help much, hard to read what is going on in Runtime:
Do you think it is something you can work with? |
Do you run very recursive models? It may not be heap memory but stack overflow. Does it reproduce if you update to Python 3.11.4? I see no mention of segfault fixes in the 3.11.4 changelog though, so I would guess it won't help. The fact that the crash is different each time feels a little bit like memory corruption to me. We use very limited Alternatively, is it possible for you to run with debug-instrumented Python and pydantic-core versions so the core dumps are more useful? I can potentially help with configuring a custom pydantic-core build to contain debug info, for Python it depends how your production is deployed. |
In pydantic/pydantic-core#922 I've run through the |
@davidhewitt Thanks for reply! By running debug-instrumented Python, do you mean run my gunicorn using |
It would be a great start if you can download or build your CPython with debug symbols included so the core dumps are much more readable. Potentially you could also build your own |
Hey @davidhewitt ! How can we instrument our Python with debug symbols? Like build it with extra CFLAGS:
Can you also help me with building Thanks! |
@StasEvseev I just built my own 3.12 interpreter from source using just the "optimized" configure options here: https://devguide.python.org/getting-started/setup-building/#optimization This contained debug information, so it looks like the debug info stripping is probably done by your distro packager. ~/dev/cpython$ ./python --version
Python 3.12.0
~/dev/cpython$ file ./python
./python: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=7487db6f0e6d73eda7cb2dbddb39706d3658e7b3, for GNU/Linux 3.2.0, with debug_info, not stripped Which linux distribution are you using? There may be an optional package to install python debug info alongside the main executable, as an alternative to building from source. (That said, I could not see one for ubuntu.) As for CARGO_PROFILE_RELEASE_STRIP=false CARGO_PROFILE_RELEASE_DEBUG=limited make build-prod
# or if you want fully-optimized
CARGO_PROFILE_RELEASE_STRIP=false CARGO_PROFILE_RELEASE_DEBUG=limited make build-pgo can see that it contains debug info:
EDIT: added suggestion for |
Hey @davidhewitt ! What we are using is python docker image.
Link to docker source https://github.com/docker-library/python/blob/b7b91ef359a740a91caeabce414ce4ee70fd2b23/3.11/bookworm/Dockerfile#L44. I might try to build custom python with your suggested flags. |
If I had to guess, the stripping is done as a linker argument via
|
We also have same or similar problem.
/* CAUTION: PyDict_SetItem() must guarantee that it won't resize the
* dictionary if it's merely replacing the value for an existing key.
* This means that it's safe to loop over a dictionary with PyDict_Next()
* and occasionally replace a value -- but you can't insert new keys or
* remove them.
*/
int
PyDict_SetItem(PyObject *op, PyObject *key, PyObject *value)
{
if (!PyDict_Check(op)) {
PyErr_BadInternalCall(); // <------ This line
return -1;
}
assert(key);
assert(value);
Py_INCREF(key);
Py_INCREF(value);
return _PyDict_SetItem_Take2((PyDictObject *)op, key, value);
} (or just seg fault without any usefull message or traceback)
I can reproduce it locally with one worker setup. But unfortunately I can not figure out minimal code example, it just happens from time to time. Is there any info that could help you? We already started updating our project to v2 and now we are stuck with half of our models being v1 and others - v2. |
<
8000
p dir="auto">@bogdandm does the error ever include the native stack trace? That would be extremely helpful to review where the problem is coming from. Alternatively if you are able to get a core dump (e.g. try running with ulimit -c unlimited ) and share relevant parts here that would also greatly help 🙏
|
@davidhewitt I haven't been able to figure out yet how to get more detailed logs or usual "core dumped" error (until now I believed that it is default behavior, at least in our docker environment). I already tried But I'll probably try again a little later when I have more time to debug it. |
If you have a way to reproduce it locally perhaps we can also discuss a way for me to help debug your code in a confidential environment. |
Okay, I can reproduce it within gdb , so there is stack trace
I can try to compile Python with more debug info if you need too. Lib versions:
P.S. This is not gunicorn related crash, I used local django runserver and enabled gevent on server startup ( |
Hmm, so looks like the call to That might also imply there is memory corruption earlier in the process. Are you willing to run under valgrind? (I can help figure out an invocation for this.) We should probably also add valgrind to the pydantic-core CI. |
Nothing specific. It is actually one super large model that describes whole page on one site. I also suspects some memory corruption, at some point I have weird objects that produces totally random errors. When I started investagating them (obj.dict and other usuall staff) - they had random properties from other objects. I.e. simple lazy translation string (gettext_lazy from django) has I can try valgrind, in local environment it is probably safe enough, you can contact me on linkedin (link in github profile) |
I was able to run valgrind on the valgrind --leak-check=full --track-origins=yes --log-file=valgrind-output.txt python -m pytest The contents of If you're getting a lot of messages, you might want to check if you have |
@bogdandm Thanks for jumping on the issue and help with investigation! For me it a little bit troublesome to reproduce on local environment (due to complex setup). |
I contacted @ davidhewitt and give him all logs that I was able to collect from my project. So now all hope is that he will be able to figure it all out 🙏🏻 |
Yep, I'm looking into this at present and hope to have some progress within a few weeks. Will keep posted here. |
Just ran into similar issues. M1, Python 3.12.0 and 3.12.1. Pydantic 2.5.2. It only happens with gevent monkey-patched. I also see that we all are using flask. |
So I am getting multiple errors, they seem to be pretty random, but it's mostly SIGSEGV/SIGBUS. I am also running into I compiled a debug version of python, and while those errors still happen - a new one started to appear:
It's a bit weird that faulthandler does not list pydantic core in extensions. Also I'm running this with the following env variables to reduce the amount of c extensions used: |
I was able to reproduce this error with @rafales's example from #8392. Thanks so much @rafales, that's really helpful. @davidhewitt and I will do some further digging, specifically:
|
I think it's very likely this is related to gevent/gevent#1819. My dumb theory: gevent is switching thread when pydantic-core/pyo3 effectively calls |
Ok, some progress here: I can isolate the crash to just PyO3 + I will work to figure out next steps from here. We have at least one pathway to a solution (in the new PyO3 API) but maybe there are mitigations we can get across the ecosystem faster. |
To follow up with the current state of things: in PyO3 we felt that mitigations are probably impractical from a performance standpoint so we are busy getting the new PyO3 API to a point where it can be used by projects to migrate. This might be a few weeks off still depending on review speed. |
any update withe the state of the problem? |
We need wait for the new pyo3 API/GIL pool. That's getting pretty close, check the progress in the pyo3 repo. |
With the release now done in PyO3 0.21, and pydantic-core updated, I can no longer reproduce the crash on pydantic |
Uh oh!
There was an error while loading. Please reload this page.
Initial Checks
Description
Thanks for amazing project!
We have been using pydantic for couple of years and it become a standard building block for our codebase.
Everything seems to work, except that once we made a change to v2 version. There has been some problems with a segfaults on production environment.
We haven't figured out a way to reproduce it locally, to provide you more details then just logs from our production environment.
Our setup:
And those are segfaults we are facing on production:
Example Code
No response
Python, Pydantic & OS Version
Selected Assignee: @dmontagu
The text was updated successfully, but these errors were encountered: