pickle is a security issue #52596

KOLANICH · 2021-02-22T09:40:22Z

🚀 Feature

We need to do something with it.

Motivation

Pickle is a security issue that can be used to hide backdoors. Unfortunately lots of projects keep using torch.save and torch.load.

Pitch

make pytorch.load use pickle only as a serialization format, use an own virtual machine (https://github.com/CensoredUsername/picklemagic can be helpful) for processing pickle files that will do only allowed operations in pytorch itself in a completely controlled way instead of relying on pickle machinery.
replace with ONNX
deprecate pytorch.load, pytorch.save
remove pytorch.save/make it save into ONNX

Alternatives

support pickle via a VM indefinitely.

cc @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @ailzhang

The text was updated successfully, but these errors were encountered:

vadimkantorov · 2021-02-22T17:53:26Z

Related: #52181 ... One of my suggestions was supporting hdf5 for safe weights storage

rgommers · 2021-02-22T18:31:16Z

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

Deprecating doesn't seem warranted; the Python pickle module itself (https://docs.python.org/3/library/pickle.html) has a similar warning. As does numpy.load (the issue is smaller there, but it exists and will remain with a docs warning).

A good alternative does seem needed indeed. That part of the issue seems duplicate with gh-52181.

vadimkantorov · 2021-02-22T18:42:06Z

torch.hub.load doesn't have such a big clear notice + may be good it printed a UserWarning that's easier to notice (not everyone would check in docs these "simple" methods). + I also didn't see such big red warnings at model zoos like hugging face

a nightmare scenario: some "cool" recent model uploaded to hugging face by some malicious actors, publicized on twitter, and then gets backdoor access at research labs over the world... it must execute code, i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

KOLANICH · 2021-02-22T19:26:35Z

It already has a clear warning in https://pytorch.org/docs/master/generated/torch.load.html:

It doesn't prevent people from using it for loading models and from storing models into this format.

i propose to have some this_is_dangerous_and_can_in_principle_execute_malicious = True mandatory keyword argument to these code-executing methods, so that the user is actively aware of this

Not enough. Phasing out it is the only feasible way in long term. In short term the solution is to defuse pickle used and make it clear not only to devs of the software but also to users that they should blame the devs still using pickle. Such dangerous temporary solutions when introduced should have a clear deadline. Also IMHO it was a big mistake to introduce pickle into python at all.

zou3519 · 2021-03-01T18:22:33Z

cc @ailzhang for hub -- what do you think? Folks are concerned about who can post to hub and if downloaded models can contain viruses.

cc @suo for package or deploy: I remember there was some discussion around unifying the serialization behavior of torch.save/load (and using zipfile serialization?). One of the things proposed in this issue is to change the serialization format for PyTorch. We probably don't want to go down the route of yet another serialization format here.

vadimkantorov · 2021-03-01T18:30:15Z

For zipfile - I hope it doesn't resort to good old pickle torch.load to handle loading of weights from inside the zipfile.

In my opinion, all existing frameworks, including TF / tfjs have screwed up weights serialization, so we already ended up with 10 different formats :( So adding another "good" one should not be a problem, if it is secure, performant and interoperable.

vadimkantorov · 2021-03-01T18:30:59Z

Also I wonder if huggingface does anything about verifying who submits models there. I think huggingface hub is now getting more popular than the slow-moving torchhub one

KOLANICH · 2021-03-01T20:01:45Z

I don't think WHO sends the model is the main issue. If the models themselves can not be Turing-complete and a complex of a model + pytorch machinery by itself cannot be Turing complete and if models don't allow priveleged operations within computation graph, such as calling any API which effect is not simple computation, we are likely safe, if operations with numbers are implemented correctly (they can be not: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3177 ).

So at first we should concentrate not on WHO can upload models, but on WHAT models can do. Authenticating models via digital signatures is clearly useful, but its effect will be very minor.

The problem with models is that they are large, binary and hard to audit for humans. It won't help much to know that the model was published by huggingface, if we still don't trust huggingface and have no real reasons to trust them.

Also there are parties other than huggingface. Should we not to use their models because we don't trust them?

So we need not attribution in the first place, but a proof that any model cannot compromise integrity of a system and confidentiality of the information processed in it in some sensible threat model. So we trust not the persons issuing the models, but the beleif that our threat model is sensible.

malfet · 2022-09-07T05:47:43Z

After looking at https://github.com/python/cpython/blob/0d04b8d9e1953d2311f78b772f21c9e07fbcbb6d/Lib/pickle.py#L1137 implementation (which can be considered a reference implementation of the feature), it looks like above requirements can be achieved by restricting what classes/methods are safe to instantiate via find_class method, that torch.load already overrides in

pytorch/torch/serialization.py

Lines 1086 to 1093 in 1a33e94

    
           def find_class(self, mod_name, name): 
        
               if type(name) is str and 'Storage' in name: 
        
                   try: 
        
                       return StorageType(name) 
        
                   except KeyError: 
        
                       pass 
        
               mod_name = load_module_mapping.get(mod_name, mod_name) 
        
               return super().find_class(mod_name, name)

This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang

This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang (cherry picked from commit 961ebca)

* Tweak several test serialization to store models state_dict (#87143) Namely, change: - `test_meta_serialization` - `test_serialization_2gb_file` - `test_pathlike_serialization` Pull Request resolved: #87143 Approved by: https://github.com/ezyang (cherry picked from commit 4a533f1) * Add `weights_only` option to `torch.load` (#86812) This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses #52596 Pull Request resolved: #86812 Approved by: https://github.com/ezyang (cherry picked from commit 961ebca)

dhrubo-os · 2023-03-29T01:17:40Z

Is there any update on this issue?

Is this true also for torchScript file? If I trace a model and then load the model, will this risk be still there? In my understanding, when we trace the model, only weight can be picked up.

Can anybody please confirm?

@KOLANICH, @malfet @vadimkantorov

Thanks.

JaimeArboleda · 2024-04-02T14:13:07Z

Hello! I am concerned about this issue. In our organization, the Security team wants to avoid those type of risks, but on the other hand, we data scientists want to work with torch hub and huggingface hub. I was wondering if there is any way of checking the safety of a torch model in an environment like Google Colab prior to downloading it in our private servers. Do you think this could work?

bdhirsh added module: pickle Problems related to pickling of PyTorch objects triage review labels Feb 22, 2021

gchanan added the module: hub label Feb 23, 2021

zou3519 added the module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects label Mar 1, 2021

anjali411 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Mar 1, 2021

turian mentioned this issue Apr 28, 2021

torch.load(..., weights_only=True) currently raises a Deprecation warning + [proposal] weights_only=True should become default for safe legacy-loading pickles #52181

Closed

ezyang mentioned this issue Feb 6, 2022

Well documented, safe method to deserialize model parameters from untrusted sources #6107

Open

ezyang added the topic: security label Feb 6, 2022

malfet self-assigned this Sep 7, 2022

malfet mentioned this issue Oct 14, 2022

Add weights_only option to torch.load #86812

Closed

malfet mentioned this issue Oct 21, 2022

Add weights_only option to torch.load #87443

Merged

mayankjobanputra mentioned this issue Nov 4, 2022

Better model security with new PyTorch version deepset-ai/haystack#3521

Closed

nagadomi mentioned this issue Jun 13, 2023

Batch processing of folders. nagadomi/nunif#42

Closed

Adversarian mentioned this issue Apr 14, 2024

Weight conversion to safetensors format hezarai/hezar#153

Closed

DN6 mentioned this issue Aug 19, 2024

Loading PT Files for VAE with the AutoEncoder is still broken!!! huggingface/diffusers#9154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pickle is a security issue #52596

pickle is a security issue #52596

pickle is a security issue #52596

pickle is a security issue #52596

Comments

🚀 Feature

Motivation

Pitch

Alternatives