-
Notifications
You must be signed in to change notification settings - Fork 24.1k
torch.load(..., weights_only=True)
currently raises a Deprecation warning + [proposal] weights_only=True
should become default for safe legacy-loading pickles
#52181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for building on standardized formats. Zarr may be nicer than HDF5 because it's better in cloud environments and it doesn't need such heavyweight libraries to support it. The one interesting thing I noticed in https://pytorch.org/docs/stable/notes/serialization.html is that when saving multiple tensors together, their view relationships get preserved. That will require something a little custom perhaps. |
advantages of HDF5: mature and has bindings for all languages; libhdf5 is relatively easy to compile from sources and doesn't have dependencies; natively supports hierarchical named sub-arrays |
There also seems to be some confusion about what Local checkpointing makes perfect sense, and for that pickling is just fine. But pickling isn't guaranteed to be portable across Python versions, so shouldn't be used for publishing models for download like that.
Does this matter? The current use cases for |
Exchange somewhat matters, since at times people would be re-implementing models, say, in TFv2 or whatever new flavor of JAX and want to consume older weights without relying on other framework as a dependency (i.e. h5py is a less intrusive dependency than full PyTorch). I fought this a lot when consuming Caffe weights from PyTorch :( or TFv1 weights from PyTorch. I wish frameworks thought more of this interop :( |
Yes that makes sense. Data exchange certainly matters. But then you'd not want features like storing the state of modules, and preserving view relationships (or odd strides, etc.). You'd just want one or more tensors in a well-optimized and portable binary format. Import/export capability for both HDF5 and Zarr would be nice for that. |
No, doesn't really. Imagine a proprietary app we don't trust running in a heavily sandboxed environment and an own app consuming the model produced by a proprietary app. In this case it would be a sandbox escape.
People will still use pickle if they had the opportunity, at least because it is the laziest possible solution of the problem of storage models and apps settings. No matter deprecated or not, if it works for them, they will use it. So the only sensible way, if we want to get rid of the insecurity at all, is to make it unusable for them, so either to make them a little less lazy, or let them go and trouble users of other frameworks ecosystems with their insecure code. |
That example has nothing to do with local checkpointing. It's model exchange, for which indeed there is ONNX. |
Well, these "local checkpoints" end up as weights files for all torchvision models... Whereas those could have used ONNX... |
This still sounds confusing. "weights files" contain, I assume, data. ONNX is "neural network exchange" - it's for storing models. Pickling has yet again different tradeoffs, it can capture state for pytorch models in ways that ONNX cannot (even aside from ONNX's incompleteness - because ONNX has to work cross-library while picking can be pytorch-specific). It looks to me to disentangle this, we need to clearly separate these three things. |
I guess it shouldn't capture whole any state, but only precisely defined subset of state that can be restored without any execution of any Turing-complete code (or that can be escallated to execution of Turing-complete code) stored within the "checkpoint" (and any Turing-complete code shouldn't be stored there). If someone has to store the Turing-complete code that must be executed within the data even for such cases, it means he is too lazy to implement the functionality properly. If one has no other choice than to do that, it feels like there is something is wrong with the system. So, is there any real necessity ("it absolutely cannot be done without a Turing-complete deserialization even if we spent a year of full-time work on redesigning the code and written auxilary code that currently is not needed because |
@KOLANICH as I pointed out in my first comment, it's about things like "when saving multiple tensors together, their view relationships get preserved". Turing completeness is besides the point.
You made your point, you dislike That doesn't negate the point that |
That made the pretrained models available in the 8000 Net (produced by some large corps BTW) completely useless without additional efforts about reverse engineering the pickle files to make sure that there is no backdoors within them.
I wonder how complex are the relations and if they really can be preserved, i.e. by keeping them in a JSON-like dict serialized into some serialization format for JSON-like dicts and restored with a custom code.
I absolutely aggree. Just wanna know how much the task is complex and how much work it is. I also have created a framework that may (or may not) simplify the task. |
When you pay attention to it, there's indeed a lot of I searched all repos for other formats and asked a few people. There's not much, only a couple of examples using HDF5 like https://github.com/pytorch/fairseq/blob/89a4d2bc70fd680c4768803d20707ef65df89b0f/examples/wav2vec/wav2vec_featurize.py#L95. And a set of issues and Discourse posts with discussions around correct usage through
It looks like you're using JSON - that's not a great alternative, it'll be way too slow compared to binary formats.
Right now this isn't actionable, it needs deciding what to do first. |
And PTH is extra-bad for #14864 |
Not quite. The framework
For serialization it entirely relies on
It was a big mistake to give the format an own extension. If it was given TBH, I usually use TensorFlow (because it has built-in complex numbers). When I had to use one pretrained model, and it turned out the code is for pytorch, I have detected that the files are pickle only because I have large experience of dealing with python projects and know that when most of projects need to ship some data with code or serialize, they just rely on pickle instead of developing a format and serialization and parsing routines.
|
I am currently organizing a NeuroIPS competition, in which participants might be submitting pytorch models to our evaluation server for the leaderboard. Is there a secure way of serializing/loading untrusted pytorch models? Are there alternatively to pickle, which can be insecure? Is there a pytorch model format we can insist upon that is secure? I believe that ONNX + tensorflow use protobuf, which avoids the security issues of pickling. That's an alternative to HDF5 to consider I see that this is related to #52596 |
https://github.com/CensoredUsername/picklemagic may be helpful to implement an own loader for pth files in a controlled way. |
@KOLANICH thank you but that repo is five years old with a single contributor. It's hard to trust abandonware that might not even work with the latest pickle format. |
I have recently used it, not for pytorch files though. The stuff having date of this year hasn't worked, that library has. |
@KOLANICH have you used it recently? |
In March. In some python package a very suspicious pickle was inlined... Though it turned out it was benign. |
Kaitai Struct has a Rust target (C++ target is also present, and is thought to be secure, but I'm not sure anyone has really checked it, C target is very immature, and none of the targets (except of Java one to some minor extent) support serialization currently). |
Might be good to have safe mode default for pytorch 2.0 |
Is weights_only=True default in PyTorch 2.0 now? |
related #97495 |
Putting in a request for weights_only=True support for OrderedDict 's . Edit: looking like collections.OrderedDict is supported but typing.OrderedDict is not... the joys of user submitted models |
Okay, I went down the rabbit hole on this. It looks like typing.OrderedDict was used in state_dict() in earlier versions of torch like 1.9 Note that this was first changed to collections.OrderedDict and then finally just Dict Many issues are seemingly somewhat related to this #72778 So in summary right now I don't think the safe loading is compatible with models trained using torch==1.9 and safe loading will fail on floats for torch<2.0.0. Adding typing.OrderedDict into the allowed pickle types would enable backwards compatibility to models trained using torch==1.9 |
Adding typing.OrderedDict to the allowed list shouldn't be too hard. Send us a patch? |
I hope that TorchServe loads models with Otherwise, one untrusted model can gain access to the server (e.g. one team's model can steal some private data from another team's model) |
torch.load(..., weights_only=True)
currently raises a Deprecation warning + [proposal] weights_only=True
should become default for legacy-loading pickles
torch.load(..., weights_only=True)
currently raises a Deprecation warning + [proposal] weights_only=True
should become default for legacy-loading picklestorch.load(..., weights_only=True)
currently raises a Deprecation warning + [proposal] weights_only=True
should become default for safe legacy-loading pickles
Use the same strategy as for unsafe pickler, i.e. use dummy `torch.serialization.StorageType` to represent legacy typed storage classes during deserialization. Add `_dtype` property to be able to use it for both new and legacy format deserialization. Parametrize `test_serialization_new_format_old_format_compat` Add regression test to validate that loading legacy modes can be done without any warnings Before the change: ``` % python test_serialization.py -v -k test_serialization_new_format_old_format_compat_ test_serialization_new_format_old_format_compat_cpu (__main__.TestBothSerializationCPU) ... ok test_serialization_new_format_old_format_compat_safe_cpu (__main__.TestBothSerializationCPU) ... /Users/nshulga/git/pytorch/pytorch/torch/_utils.py:836: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() ok ---------------------------------------------------------------------- Ran 2 tests in 0.116s OK ``` Without the change but update test to catch warnings: ``` % python test_serialization.py -v -k test_serialization_new_format_old_format_compat_ test_serialization_new_format_old_format_compat_weights_only_False_cpu (__main__.TestBothSerializationCPU) ... ok test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU) ... FAIL ====================================================================== FAIL: test_serialization_new_format_old_format_compat_weights_only_True_cpu (__main__.TestBothSerializationCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2536, in wrapper method(*args, **kwargs) File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_device_type.py", line 415, in instantiated_test result = test(self, **param_kwargs) File "/Users/nshulga/git/pytorch/pytorch/test/test_serialization.py", line 807, in test_serialization_new_format_old_format_compat self.assertTrue(len(w) == 0, msg=f"Expected no warnings but got {[str(x) for x in w]}") AssertionError: False is not true : Expected no warnings but got ["{message : UserWarning('TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()'), category : 'UserWarning', filename : '/Users/nshulga/git/pytorch/pytorch/torch/_utils.py', lineno : 836, line : None}"] To execute this test, run the following from the base repo dir: python test/test_serialization.py -k test_serialization_new_format_old_format_compat_weights_only_True_cpu This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 2 tests in 0.109s FAILED (failures=1) ``` Fixes problem reported in #52181 (comment) Pull Request resolved: #113614 Approved by: https://github.com/kit1980, https://github.com/albanD
Of course, the real problem there is that execution of internet-published code is the desired scenario But I think, this should at least push PyTorch to make weight loading safe by default... And maybe have an audit of where PyTorch loads unsafe pickles - at least to force the user to opt-in by forcing a mandatory argument: |
also quite in favor of something like this^ |
give the recent popularity of "backdoor" theme in general, hoping |
@mikaylagawarecki @msaroufim So this issue can now be closed as completed? :) If this is merged: |
When I tried to upload the pickle file in the hf space for my ensemble model on prediction, it shows hf picklescan error. Can you provide any solutions? joblib This is my requirements.txt file |
@Jannat20242NSU that looks like an issue that should be filed to the respective huggingface repository that shows the picklescan error |
Closing as complete since |
That doesn't allow arbitrary unpickling and thus arbitrary code execution. Maybe an option for torch.load?
Yes, one should not load/run code from unknown locations, but sometimes intermediate controls could be good: e.g. allowing to load only known types, such as tensors (and not model instances or other things), bypassing generic unpickling mechanism
(i've long time been proponent of standardized formats for weight storage such as HDF5, but this didn't get traction)
cc @ezyang @gchanan @zou3519 @mruberry @nairbv @NicolasHug @vmoens @jdsgomes @bdhirsh @jbschlosser @anjali411 @ailzhang
Also, popularity of HuggingFace hub (and existing torch.hub) makes it more acute. At some point we will have a malicious model uploaded there and become popular on twitter e.g. because it would composite in very cute cats into existing images. The malicious model can at least hijack some precious GPU compute, and at worst take over institute / company local computer networks.
The text was updated successfully, but these errors were encountered: