-
Notifications
You must be signed in to change notification settings - Fork 24.1k
pickle is a security issue #52596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related: #52181 ... One of my suggestions was supporting hdf5 for safe weights storage |
It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html: Deprecating doesn't seem warranted; the Python pickle module itself (https://docs.python.org/3/library/pickle.html) has a similar warning. As does A good alternative does seem needed indeed. That part of the issue seems duplicate with gh-52181. |
torch.hub.load doesn't have such a big clear notice + may be good it printed a UserWarning that's easier to notice (not everyone would check in docs these "simple" methods). + I also didn't see such big red warnings at model zoos like hugging face a nightmare scenario: some "cool" recent model uploaded to hugging face by some malicious actors, publicized on twitter, and then gets backdoor access at research labs over the world... it must execute code, i propose to have some |
It doesn't prevent people from using it for loading models and from storing models into this format.
Not enough. Phasing out it is the only feasible way in long term. In short term the solution is to defuse pickle used and make it clear not only to devs of the software but also to users that they should blame the devs still using pickle. Such dangerous temporary solutions when introduced should have a clear deadline. Also IMHO it was a big mistake to introduce pickle into python at all. |
cc @ailzhang for hub -- what do you think? Folks are concerned about who can post to hub and if downloaded models can contain viruses. cc @suo for package or deploy: I remember there was some discussion around unifying the serialization behavior of torch.save/load (and using zipfile serialization?). One of the things proposed in this issue is to change the serialization format for PyTorch. We probably don't want to go down the route of yet another serialization format here. |
For zipfile - I hope it doesn't resort to good old pickle torch.load to handle loading of weights from inside the zipfile. In my opinion, all existing frameworks, including TF / tfjs have screwed up weights serialization, so we already ended up with 10 different formats :( So adding another "good" one should not be a problem, if it is secure, performant and interoperable. |
Also I wonder if huggingface does anything about verifying who submits models there. I think huggingface hub is now getting more popular than the slow-moving torchhub one |
I don't think WHO sends the model is the main issue. If the models themselves can not be Turing-complete and a complex of a model + pytorch machinery by itself cannot be Turing complete and if models don't allow priveleged operations within computation graph, such as calling any API which effect is not simple computation, we are likely safe, if operations with numbers are implemented correctly (they can be not: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 ). So at first we should concentrate not on WHO can upload models, but on WHAT models can do. Authenticating models via digital signatures is clearly useful, but its effect will be very minor. The problem with models is that they are large, binary and hard to audit for humans. It won't help much to know that the model was published by huggingface, if we still don't trust huggingface and have no real reasons to trust them. Also there are parties other than huggingface. Should we not to use their models because we don't trust them? So we need not attribution in the first place, but a proof that any model cannot compromise integrity of a system and confidentiality of the information processed in it in some sensible threat model. So we trust not the persons issuing the models, but the beleif that our threat model is sensible. |
After looking at https://github.com/python/cpython/blob/0d04b8d9e1953d2311f78b772f21c9e07fbcbb6d/Lib/pickle.py#L1137 implementation (which can be considered a reference implementation of the feature), it looks like above requirements can be achieved by restricting what classes/methods are safe to instantiate via find_class method, that pytorch/torch/serialization.py Lines 1086 to 1093 in 1a33e94
|
This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang
This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang (cherry picked from commit 961ebca)
* Tweak several test serialization to store models state_dict (#87143) Namely, change: - `test_meta_serialization` - `test_serialization_2gb_file` - `test_pathlike_serialization` Pull Request resolved: #87143 Approved by: https://github.com/ezyang (cherry picked from commit 4a533f1) * Add `weights_only` option to `torch.load` (#86812) This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang (cherry picked from commit 961ebca)
Is there any update on this issue? Is this true also for torchScript file? If I trace a model and then load the model, will this risk be still there? In my understanding, when we trace the model, only weight can be picked up. Can anybody please confirm? @KOLANICH, @malfet @vadimkantorov Thanks. |
Hello! I am concerned about this issue. In our organization, the Security team wants to avoid those type of risks, but on the other hand, we data scientists want to work with torch hub and huggingface hub. I was wondering if there is any way of checking the safety of a torch model in an environment like Google Colab prior to downloading it in our private servers. Do you think this could work? |
🚀 Feature
We need to do something with it.
Motivation
Pickle is a security issue that can be used to hide backdoors. Unfortunately lots of projects keep using
torch.save
andtorch.load
.Pitch
pytorch.load
use pickle only as a serialization format, use an own virtual machine (https://github.com/CensoredUsername/picklemagic can be helpful) for processing pickle files that will do only allowed operations in pytorch itself in a completely controlled way instead of relying on pickle machinery.pytorch.load
,pytorch.save
pytorch.save
/make it save into ONNXAlternatives
cc @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @ailzhang
The text was updated successfully, but these errors were encountered: