8000 `torch.load(..., weights_only=True)` currently raises a Deprecation warning + [proposal] `weights_only=True` should become default for safe legacy-loading pickles · Issue #52181 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

torch.load(..., weights_only=True) currently raises a Deprecation warning + [proposal] weights_only=True should become default for safe legacy-loading pickles #52181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8000
vadimkantorov opened this issue Feb 12, 2021 · 64 comments
Assignees
Labels
feature A request for a proper, new feature. high priority module: hub module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects topic: security triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Feb 12, 2021

That doesn't allow arbitrary unpickling and thus arbitrary code execution. Maybe an option for torch.load?

Yes, one should not load/run code from unknown locations, but sometimes intermediate controls could be good: e.g. allowing to load only known types, such as tensors (and not model instances or other things), bypassing generic unpickling mechanism

  • maybe make it super-clear that torch.hub.load actually executes code at load/unpickling time

(i've long time been proponent of standardized formats for weight storage such as HDF5, but this didn't get traction)

cc @ezyang @gchanan @zou3519 @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @bdhirsh @jbschlosser @anjali411 @ailzhang

Also, popularity of HuggingFace hub (and existing torch.hub) makes it more acute. At some point we will have a malicious model uploaded there and become popular on twitter e.g. because it would composite in very cute cats into existing images. The malicious model can at least hijack some precious GPU compute, and at worst take over institute / company local computer networks.

@albanD albanD added feature A request for a proper, new feature. module: hub module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 12, 2021
@rgommers
Copy link
Collaborator

+1 for building on standardized formats. Zarr may be nicer than HDF5 because it's better in cloud environments and it doesn't need such heavyweight libraries to support it.

The one interesting thing I noticed in https://pytorch.org/docs/stable/notes/serialization.html is that when saving multiple tensors together, their view relationships get preserved. That will require something a little custom perhaps.

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented Feb 22, 2021

advantages of HDF5: mature and has bindings for all languages; libhdf5 is relatively easy to compile from sources and doesn't have dependencies; natively supports hierarchical named sub-arrays
disadvantages: it doesn't support number of columns more than 64000 (or didn't support a few years ago); relatively heavyweight (streaming slabs; compression support etc); (but maybe better suited that abusing protobuf and its constant warnings about unsupported files larger than 2Gb

@rgommers
Copy link
Collaborator

There also seems to be some confusion about what save and load are actually for. https://pytorch.org/docs/stable/hub.html has this example:

image

Local checkpointing makes perfect sense, and for that pickling is just fine. But pickling isn't guaranteed to be portable across Python versions, so shouldn't be used for publishing models for download like that.

advantages of HDF5: mature and has bindings for all languages

Does this matter? The current use cases for .pt certainly are Python-only. If it's about binary storage for data exchange, that's a different thing than what save is meant for.

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented Feb 22, 2021

Exchange somewhat matters, since at times people would be re-implementing models, say, in TFv2 or whatever new flavor of JAX and want to consume older weights without relying on other framework as a dependency (i.e. h5py is a less intrusive dependency than full PyTorch). I fought this a lot when consuming Caffe weights from PyTorch :( or TFv1 weights from PyTorch. I wish frameworks thought more of this interop :(

@rgommers
Copy link
Collaborator

Yes that makes sense. Data exchange certainly matters. But then you'd not want features like storing the state of modules, and preserving view relationships (or odd strides, etc.). You'd just want one or more tensors in a well-optimized and portable binary format. Import/export capability for both HDF5 and Zarr would be nice for that.

@KOLANICH
Copy link
KOLANICH commented Feb 22, 2021

Local checkpointing makes perfect sense, and for that pickling is just fine.

No, doesn't really. Imagine a proprietary app we don't trust running in a heavily sandboxed environment and an own app consuming the model produced by a proprietary app. In this case it would be a sandbox escape.

But pickling isn't guaranteed to be portable across Python versions, so shouldn't be used for publishing models for download like that.

People will still use pickle if they had the opportunity, at least because it is the laziest possible solution of the problem of storage models and apps settings. No matter deprecated or not, if it works for them, they will use it. So the only sensible way, if we want to get rid of the insecurity at all, is to make it unusable for them, so either to make them a little less lazy, or let them go and trouble users of other frameworks ecosystems with their insecure code.

@rgommers
Copy link
Collaborator

No, doesn't really. Imagine a proprietary app we don't trust running in a heavily sandboxed environment and an own app consuming the model produced by a proprietary app. In this case it would be a sandbox escape.

That example has nothing to do with local checkpointing. It's model exchange, for which indeed there is ONNX.

@vadimkantorov
Copy link
Contributor Author

Well, these "local checkpoints" end up as weights files for all torchvision models... Whereas those could have used ONNX...

@rgommers
Copy link
Collaborator

Well, these "local checkpoints" end up as weights files for all torchvision models... Whereas those could have used ONNX...

This still sounds confusing. "weights files" contain, I assume, data. ONNX is "neural network exchange" - it's for storing models. Pickling has yet again different tradeoffs, it can capture state for pytorch models in ways that ONNX cannot (even aside from ONNX's incompleteness - because ONNX has to work cross-library while picking can be pytorch-specific).

It looks to me to disentangle this, we need to clearly separate these three things.

@KOLANICH
Copy link

it can capture state for pytorch models in ways that ONNX cannot

I guess it shouldn't capture whole any state, but only precisely defined subset of state that can be restored without any execution of any Turing-complete code (or that can be escallated to execution of Turing-complete code) stored within the "checkpoint" (and any Turing-complete code shouldn't be stored there). If someone has to store the Turing-complete code that must be executed within the data even for such cases, it means he is too lazy to implement the functionality properly. If one has no other choice than to do that, it feels like there is something is wrong with the system.

So, is there any real necessity ("it absolutely cannot be done without a Turing-complete deserialization even if we spent a year of full-time work on redesigning the code and written auxilary code that currently is not needed because pickle does that for us?") in using pickles?

@rgommers
Copy link
Collaborator

@KOLANICH as I pointed out in my first comment, it's about things like "when saving multiple tensors together, their view relationships get preserved". Turing completeness is besides the point.

even if we spent a year of full-time work on redesigning the code ...

You made your point, you dislike pickle a lot. These kind of engineering tradeoffs are for the PyTorch maintainers to make though. As I pointed out, there are correct and useful ways of using pickle. And PyTorch isn't alone in that - Python stdlib provides, pickle, and libraries like NumPy and scikit-learn use it as well. So I'd say it's unlikely that anyone will be willing to spend that much effort to design something new to replace correct usage of pickling.

That doesn't negate the point that torch.hub is doing something inappropriate here. That can hopefully be fixed.

@KOLANICH
Copy link

Python stdlib provides, pickle, and libraries like NumPy and scikit-learn use it as well. So I'd say it's unlikely that anyone will be willing to spend that much effort to design something new to replace correct usage of pickling.

That made the pretrained models available in the 8000 Net (produced by some large corps BTW) completely useless without additional efforts about reverse engineering the pickle files to make sure that there is no backdoors within them.

"when saving multiple tensors together, their view relationships get preserved".

I wonder how complex are the relations and if they really can be preserved, i.e. by keeping them in a JSON-like dict serialized into some serialization format for JSON-like dicts and restored with a custom code.

These kind of engineering tradeoffs are for the PyTorch maintainers to make though.

I absolutely aggree. Just wanna know how much the task is complex and how much work it is. I also have created a framework that may (or may not) simplify the task.

@rgommers
Copy link
Collaborator

When you pay attention to it, there's indeed a lot of .pt(h) files floating around. Just saw https://github.com/POSTECH-CVLab/PyTorch-StudioGAN coming by on Twitter - no warnings, and many people will just start downloading pickle files (they won't know or care about what .pth means).

I searched all repos for other formats and asked a few people. There's not much, only a couple of examples using HDF5 like https://github.com/pytorch/fairseq/blob/89a4d2bc70fd680c4768803d20707ef65df89b0f/examples/wav2vec/wav2vec_featurize.py#L95. And a set of issues and Discourse posts with discussions around correct usage through h5py.

I also have created a framework that may (or may not) simplify the task.

It looks like you're using JSON - that's not a great alternative, it'll be way too slow compared to binary formats.

Just wanna know how much the task is complex and how much work it is.

Right now this isn't actionable, it needs deciding what to do first.

@vadimkantorov
Copy link
Contributor Author

And PTH is extra-bad for #14864

@KOLANICH
Copy link
KOLANICH commented Feb 28, 2021

It looks like you're using JSON - that's not a great alternative, it'll be way too slow compared to binary formats.

Not quite. The framework urm deals only with

  • linking several unrelationally-stored data to each other and convenient access to the data
  • storing them in cold and hot storages and transferring between them when needed.

For serialization it entirely relies on transformerz lib and a framework, allowing to specify a transformation chain. json is only one format within it, and can be easily replaced with CBOR or MSGPack or BSON (and CBOR and BSON have some customization points (1, 2)), or an own transformer can be written and plugged there (also a compressor/decompressor).

they won't know or care about what .pth means.

It was a big mistake to give the format an own extension. If it was given .pytorch.pkl extension, it would have been immediately obvious it is a pickle file.

TBH, I usually use TensorFlow (because it has built-in complex numbers).

When I had to use one pretrained model, and it turned out the code is for pytorch, I have detected that the files are pickle only because I have large experience of dealing with python projects and know that when most of projects need to ship some data with code or serialize, they just rely on pickle instead of developing a format and serialization and parsing routines.

pickle is a large success in the sense it gives developers something they need at the cost of something they don't care, don't need and are willing to sacrifice. Unfortunately it turns out that the thing that is sacrificed is security, this transforms the landscape, and I see absolutely no reasons for rational data scientists not to register a bunch of fake accounts on GH, each new paper publish under a new account and embed legal backdoors (most of open source software licenses (except WTFPL) explicitly disclaim any liability, so by using software and data user authorizes software to scan his PC for unfinished drafts of articles and source code and other confidential information such as ssh keys and exfiltrate it) into each model (that's why really serious orgs allow their employees only use the software that was audited (internally or externally, there are corps having business on auditing dependencies for backdoors) ). I absolutely won't be surprised to hear that it is already done in the wild.

@turian
Copy link
turian commented Apr 28, 2021

I am currently organizing a NeuroIPS competition, in which participants might be submitting pytorch models to our evaluation server for the leaderboard.

Is there a secure way of serializing/loading untrusted pytorch models?

Are there alternatively to pickle, which can be insecure? Is there a pytorch model format we can insist upon that is secure?

I believe that ONNX + tensorflow use protobuf, which avoids the security issues of pickling. That's an alternative to HDF5 to consider

I see that this is related to #52596

@KOLANICH
Copy link

https://github.com/CensoredUsername/picklemagic may be helpful to implement an own loader for pth files in a controlled way.

@turian
Copy link
turian commented Apr 29, 2021

@KOLANICH thank you but that repo is five years old with a single contributor. It's hard to trust abandonware that might not even work with the latest pickle format.

@KOLANICH
Copy link

I have recently used it, not for pytorch files though. The stuff having date of this year hasn't worked, that library has.

@turian
Copy link
turian commented Apr 30, 2021

@KOLANICH have you used it recently?

@KOLANICH
Copy link

In March. In some python package a very suspicious pickle was inlined... Though it turned out it was benign.

@KOLANICH
Copy link

i wonder what's state-of-the-art of actually generating format parsers from a formal spec to ensure that generated C is safe

Kaitai Struct has a Rust target (C++ target is also present, and is thought to be secure, but I'm not sure anyone has really checked it, C target is very immature, and none of the targets (except of Java one to some minor extent) support serialization currently).

@vadimkantorov
Copy link
Contributor Author

Might be good to have safe mode default for pytorch 2.0

@vadimkantorov
Copy link
Contributor Author

Is weights_only=True default in PyTorch 2.0 now?

@vadimkantorov
Copy link
Contributor Author

related #97495

@sjkoelle
Copy link
sjkoelle commented Jun 20, 2023

Putting in a request for weights_only=True support for OrderedDict 's .

Edit: looking like collections.OrderedDict is supported but typing.OrderedDict is not... the joys of user submitted models

@sjkoelle
Copy link

Okay, I went down the rabbit hole on this. It looks like typing.OrderedDict was used in state_dict() in earlier versions of torch like 1.9

9d6639a

Note that this was first changed to collections.OrderedDict and then finally just Dict

9fae076

Many issues are seemingly somewhat related to this

#72778
python/mypy#6904
voicepaw/so-vits-svc-fork#410 (comment)
#94670 (similar idea different issue)
#94910 (same)

So in summary right now I don't think the safe loading is compatible with models trained using torch==1.9 and safe loading will fail on floats for torch<2.0.0. Adding typing.OrderedDict into the allowed pickle types would enable backwards compatibility to models trained using torch==1.9

@ezyang
Copy link
Contributor
ezyang commented Jun 22, 2023

Adding typing.OrderedDict to the allowed list shouldn't be too hard. Send us a patch?

@vadimkantorov
Copy link
Contributor Author

@malfet using weights_only=True gives an ungentle error: "TypedStorage is deprecated" (please see #109108 for repro)

@vadimkantorov
Copy link
Contributor Author

I hope that TorchServe loads models with weights_only: https://www.oligo.security/shelltorch

Otherwise, one untrusted model can gain access to the server (e.g. one team's model can steal some private data from another team's model)

@vadimkantorov vadimkantorov changed the title Safe way of loading only weights from *.pt file by default torch.load(..., weights_only=True) currently raises a Deprecation warning + [proposal] weights_only=True should become default for legacy-loading pickles Nov 6, 2023
@vadimkantorov vadimkantorov changed the title torch.load(..., weights_only=True) currently raises a Deprecation warning + [proposal] weights_only=True should become default for legacy-loading pickles torch.load(..., weights_only=True) currently raises a Deprecation warning + [proposal] weights_only=True should become default for safe legacy-loading pickles Nov 10, 2023
pytorchmergebot pushed a commit that referenced this issue Nov 14, 2023
Use the same strategy as for unsafe pickler, i.e. use dummy `torch.serialization.StorageType` to represent legacy typed storage classes during deserialization. Add `_dtype` property to be able to use it for both new and legacy format deserialization.

Parametrize `test_serialization_new_format_old_format_compat`

Add regression test to validate that loading legacy modes can be done
without any warnings

Before the change:
```
% python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_safe_cpu (__main__.TestBothSerializationCPU) ... /Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
ok

----------------------------------------------------------------------
Ran 2 tests in 0.116s

OK
```
Without the change but update test to catch warnings:
```
 % python test_serialization.py -v -k test_serialization_new_format_old_format_compat_
test_serialization_new_format_old_format_compat_weights_only_False_cpu (__main__.TestBothSerializationCPU) ... ok
test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU) ... FAIL

======================================================================
FAIL: test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2536, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test
    result = test(self, **param_kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/test/test_serialization.py", line 807, in test_serialization_new_format_old_format_compat
    self.assertTrue(len(w) == 0, msg=f"Expected no warnings but got {[str(x) for x in w]}")
AssertionError: False is not true : Expected no warnings but got ["{message : UserWarning('TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()'), category : 'UserWarning', filename : '/Users/nshulga/git/pytorch/pytorch/torch/_utils.py', lineno : 836, line : None}"]

To execute this test, run the following from the base repo dir:
     python test/test_serialization.py -k test_serialization_new_format_old_format_compat_weights_only_True_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 2 tests in 0.109s

FAILED (failures=1)

```

Fixes problem reported in #52181 (comment)
Pull Request resolved: #113614
Approved by: https://github.com/kit1980, https://github.com/albanD
@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented Mar 4, 2024

https://www.bleepingcomputer.com/news/security/malicious-ai-models-on-hugging-face-backdoor-users-machines/

Of course, the real problem there is that execution of internet-published code is the desired scenario

But I think, this should at least push PyTorch to make weight loading safe by default... And maybe have an audit of where PyTorch loads unsafe pickles - at least to force the user to opt-in by forcing a mandatory argument: , this_picke_load_can_execute_viruses = True)

@julien-c
Copy link
julien-c commented Mar 4, 2024

also quite in favor of something like this^

@vadimkantorov
Copy link
Contributor Author

give the recent popularity of "backdoor" theme in general, hoping weights_only=True can become default and an audit of all unsafe pickle.load in all pytorch codebase can be done :)

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented Nov 4, 2024

@Jannat-sultana
Copy link

When I tried to upload the pickle file in the hf space for my ensemble model on prediction, it shows hf picklescan error. Can you provide any solutions?

joblib
gradio==4.44.1
huggingface_hub
scikit-learn==1.2.2
seaborn
matplotlib
lightgbm
xgboost
numpy
pandas

This is my requirements.txt file

@mikaylagawarecki
Copy link
Contributor

@Jannat20242NSU that looks like an issue that should be filed to the respective huggingface repository that shows the picklescan error

@mikaylagawarecki
Copy link
Contributor

Closing as complete since weights_only default was flipped in #137602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. high priority module: hub module: serialization Issues related to serialization (e.g., via pickle, or otherwise) of PyTorch objects topic: security triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

0