8000 PyTorch serialization formats · Issue #31877 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

PyTorch serialization formats #31877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lutzroeder opened this issue Jan 5, 2020 · 11 comments
Closed

PyTorch serialization formats #31877

lutzroeder opened this issue Jan 5, 2020 · 11 comments
Assignees
Labels
high priority oncall: jit Add this issue/PR to JIT oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@lutzroeder
Copy link
Contributor
lutzroeder commented Jan 5, 2020

@soumith, @ezyang, as PyTorch serialization formats keep changing and evolving, is there a scheme to name and version the different formats to avoid confusion?

Something along these lines:

Description Format
Tar file with sys_info, pickle, storages, tensors PyTorch v0.1.1
Multi-Pickle file with 8a0a6cfc9c... signature PyTorch v0.1.10
Zip file containing constants.pkl and model.json TorchScript v1.0
Zip file containing constants.pkl and data.pkl TorchScript v1.3
Zip file containing data.pkl but no code PyTorch v1.3

cc @ezyang @gchanan @zou3519 @suo

@ezyang
Copy link
Contributor
ezyang commented Jan 8, 2020

I agree we should have these docs

@ezyang
Copy link
Contributor
ezyang commented Jan 8, 2020

I believe the zip format should have a magic number that is used for versioning but I am not sure we have publicly documented it

@ezyang ezyang added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 8, 2020
@lutzroeder
Copy link
Contributor Author
lutzroeder commented Jan 9, 2020

The higher order question is if there is a consistent and "official" way to refer to ALL formats, including the legacy formats. If the scheme in the description above makes sense, then Zip format version in the /version file should probably match the version of PyTorch the format change was introduced in (/version is already set to 1 for multiple formats so PyTorch Zip v1 doesn't work as a description). This would lead to a pattern like PyTorch Zip v1 (Preview), PyTorch Zip v1, PyTorch Zip TorchScript v1, PyTorch Multi-Pickle and PyTorch Tar Pickle which would be more descriptive but gets confusing quickly.

@driazati
Copy link
Contributor
driazati commented Jan 9, 2020

Going through the current versions down to 0.3, it looks like we haven't been that great about versioning. The eager format has been very stable as the multi-pickle file (no changes since 0.3, the .tar format is 3 years old and I couldn't even get torch<0.3 to install correctly), and the protocol version / magic numbers we encode in thusly the eager serialization haven't changed (1001 and 0x1950A86A20F9469CFC6C respectively).

The TorchScript format has gone through more structural changes with only 1 really fundamental change (when model.json was removed in favor of data.pkl), but (as you have found) we haven't been bumping the version numbers accordingly (we only did once in #28122). So there's really no way to tell the versions apart from inspecting the presence of certain files.

There is also a new serialization format for eager ("Eager v2" below) coming in 1.4.0 that will be hidden be 8000 hind a flag (torch.save(obj, file, _use_new_zipfile_serialization=True)), but the serialized format is a zip file that matches the TorchScript version.

PyTorch Version Eager Format TorchScript Format TorchScript archive/version file
1.4 Eager v1 / Eager v2 (via a flag) Script v4 2
1.3 Eager v1 Script v4 1
1.2 Eager v1 Script v3 1
1.1 Eager v1 Script v2 1
1.0 Eager v1 Script v1 1
0.4 Eager v1 n/a n/a
0.3 Eager v1 n/a n/a

These commands show the contents of the zipfiles for each version, the binaries are all from here.

Script v1, v2, v3, v4

$ unzip script_example_1.4.0.pt
Archive:  script_example_1.4.0.pt
 extracting: script_example_1.5.0a0/version
 extracting: script_example_1.5.0a0/data/0
 extracting: script_example_1.5.0a0/data/1
 extracting: script_example_1.5.0a0/data.pkl
  inflating: script_example_1.5.0a0/code/__torch__.py
  inflating: script_example_1.5.0a0/code/__torch__.py.debug_pkl
 extracting: script_example_1.5.0a0/constants.pkl

$ unzip script_example_1.3.1.pt
Archive:  script_example_1.3.1.pt
 extracting: script_example_1.3.1/version
  inflating: script_example_1.3.1/code/__torch__.py
  inflating: script_example_1.3.1/code/__torch__.py.debug_pkl
 extracting: script_example_1.3.1/constants.pkl
 extracting: script_example_1.3.1/data/0
 extracting: script_example_1.3.1/data/1
 extracting: script_example_1.3.1/data.pkl

$ unzip script_example_1.2.0.pt
Archive:  script_example_1.2.0.pt
 extracting: script_example_1.2.0/version
 extracting: script_example_1.2.0/code/script_example_1.2.0.py
 extracting: script_example_1.2.0/debug/script_example_1.2.0.pkl
 extracting: script_example_1.2.0/attributes.pkl
 extracting: script_example_1.2.0/tensors/0
 extracting: script_example_1.2.0/tensors/1
 extracting: script_example_1.2.0/model.json

$ unzip script_example_1.1.0.pt
Archive:  script_example_1.1.0.pt
 extracting: script_example_1.1.0/version
 extracting: script_example_1.1.0/code/script_example_1.1.0.py
 extracting: script_example_1.1.0/attributes.pkl
 extracting: script_example_1.1.0/tensors/0
 extracting: script_example_1.1.0/tensors/1
 extracting: script_example_1.1.0/model.json

$ unzip script_example_1.0.0.pt
Archive:  script_example_1.0.0.pt
 extracting: script_example_1.0.0/version
 extracting: script_example_1.0.0/code/script_example_1.0.0.py
 extracting: script_example_1.0.0/tensors/0
 extracting: script_example_1.0.0/tensors/1
 extracting: script_example_1.0.0/model.json

Eager v2

$ unzip unzip eager_example_new_1.5.0a0.pt
Archive:  eager_example_new_1.5.0a0.pt
 extracting: eager_example_new_1.5.0a0/version  
 extracting: eager_example_new_1.5.0a0/data.pkl  
 extracting: eager_example_new_1.5.0a0/data/94148253811520

@lutzroeder
Copy link
Contributor Author
lutzroeder commented Jan 10, 2020

@driazati thank you for sharing the files.

686e8d3 introduced .tar on 2016-08-22 which matches PyTorch v0.1.1.
e71cf20 introduced 0x1950A86A20F9469CFC6C on 2017-02-22 which matches PyTorch v0.1.10.

I'm trying to get more specific on how different formats can be uniquely named. For example, instead of "this file is in .tar format", would it be correct to say "this file is in PyTorch v0.1.1 format"? Not sure PyTorch Eager v1 would mean much and sounds like going forward both Eager and TorchScript going to use the same Zip container format.

For example, assume a tool needs to tell a user which format a .pth file is in:

def get_display_format(file):
  if is_tar_file(file):
    return "XXXXX" # PyTorch v0.1.1?
  if is_multi_pickle_file(file):
    return "XXXXX" # PyTorch v0.1.10?
  if is_zip_file(file):
    if has_version_file(1) or has_version_file(None):
      if has_model_json(file):
        return "XXXXX" # PyTorch v1.0?
      if has_data_pkl(file) and has_code_folder(file):
        return "XXXXX" # PyTorch v1.3 (TorchScript)?
      if has_data_pickle(file) and not has_code_folder(file):
        return "XXXXX" # PyTorch v1.4?
    else:
      if has_code_folder(file):
        return "XXXXX v" + generate_display_version_from_version_file(file) + " (TorchScript)"
      else:
        return "XXXXX v" + generate_display_version_from_version_file(file)

Question 1: What should those specific XXXXX format names be? Would it make sense to just use the PyTorch version that started creating this format and if so are the ones suggested in comments correct? Sounds like the answer is yes. Detecting script variants might be difficult but if it's possible there could be more conditions added to be more specific.

Question 2: What can be done to simplify this so generate_display_version_from_version_file would produce a human readable version going forward? Maybe TorchScript archive/version file (the last column in the table) going forward should match the actual PyTorch Version (first column) that introduced this format?

Question 3: Some files include tensor state only while others include code or model structure as well. Since this seems to cause confusion among users is there a recommended way to include this in the format XXXXX name? Not sure there are any easy ways to do this (same issue exists for Keras) so it's if there isn't an answer.

@driazati
Copy link
Contributor

Can you give some more context on the background for this issue? An easy way to get everything into 1 format would be to load and re-save in the latest version of PyTorch. Since we're fully backwards compatible down to 0.1, we can load any objects (in the eager case) or TorchScript models and save them with the latest format.

  1. I think a reasonable naming scheme would be something like (Eager|TorchScript) v(1.0.0|1.1.0|etc.) so it's informative a) what type of file it is and b) tells the minimum version of PyTorch that can load that file
  2. Going forward the version 8000 number should similarly refer to the minimum version of PyTorch that can load this archive.
  3. This is the main distinction between TorchScript and Eager saves. TorchScript (torch.jit.save) files are made to allow exporting a model to C++, so it includes the module hierarchy and TorchScript code details. Eager mode save files (torch.save) follow Python's pickle, with a layer on top to support saving Tensors, and pickle does not save class definitions or code, it relies on the code referenced within to be defined when it is loaded.

@lutzroeder
Copy link
Contributor Author
lutzroeder commented Jan 11, 2020

Context is which format description Netron should show to users for diagnosing issues or discussing the changes PyTorch is going through. The other goal is to have some principles for version changes that will be followed going forward.

screenshot

Would this be a correct representation of what we discussed so far?

def get_display_format(file):
  if is_tar_file(file):
    return "PyTorch Eager v0.1.1"
  if is_multi_pickle_file(file):
    return "PyTorch Eager v0.1.10"
  if is_zip_file(file):
    if has_version(file, 2) or has_version(file, 1) or has_version(file, None):
      if has_model_json(file):
        return "PyTorch Script v1.0"
      if has_data_pkl(file):
        if has_code_folder(file):
          if has_version_file(file, 2):
            return "PyTorch Script v1.4"
          return "PyTorch Script v1.3"
        else:
          return "PyTorch Eager v1.4"
    else:
      if has_code_folder(file):
        return "PyTorch Script v" + generate_display_version_from_version_file(file)
      else:
        return "PyTorch Eager v" + generate_display_version_from_version_file(file)

It is still unclear how generate_display_version_from_version_file would work going forward. Are you planing to change the implementation of /version to store 1.5 instead of 2 going forward?

lutzroeder added a commit to lutzroeder/netron that referenced this issue Jan 11, 2020
@driazati
Copy link
Contributor

Looks mostly good (pinging @ezyang for any thoughts), a couple notes:

  • instead of has_code_folder checking for the presence of constants.pkl is probably safer, this is what we do in our deserialization code to differentiate between the two
  • Instead of things like PyTorch Script v1.0 it should be TorchScript v1.0 so everything is consistent with our docs and tutorials

So in the end it'd look something like

def get_display_format(file):
  if is_tar_file(file):
    return "PyTorch v0.1.1"

  if is_multi_pickle_file(file):
    return "PyTorch v0.1.10"

  if is_zip_file(file):
    if has_model_json(file):
      if has_attribute_pkl():
        return "TorchScript v1.1"
      else:
        return "TorchScript v1.0"

    if has_data_pkl(file):
      if has_constants_pkl(file):
        if has_version_file(file, 2):
          return "TorchScript v1.4"
        return "TorchScript v1.3"
      else:
        return "PyTorch v1.4"

We discussed the /version internally and it probably won't change from its current format (a single number that gets bumped any time we change the way we serialize TorchScript code, so not any time there are changes to the file format)

@lutzroeder
Copy link
Contributor Author

Is TorchScript v1.3 intuitive enough? The docs never mention that TorchScript v1.3 would be related to requiring PyTorch v1.3 vs. calling it PyTorch TorchScript v1.3 or PyTorch v1.3 TorchScript making this more explicit?

We discussed the /version internally and it probably won't change from its current format (a single number that gets bumped any time we change the way we serialize TorchScript code, so not any time there are changes to the file format)

Is there a way for tools to derive the PyTorch version needed for a given TorchScript file? If /version isn't changing would it make sense to add another field or file like /producer to add this information?

lutzroeder added a commit to lutzroeder/netron that referenced this issue Jan 14, 2020
lutzroeder added a commit to lutzroeder/netron that referenced this issue Jan 14, 2020
@ezyang
Copy link
Contributor
ezyang commented Jan 14, 2020

@driazati I have little to say about the exact details of how we are version testing, or what the variants of the versions should be named (aligning them with PyTorch releases sounds reasonable). What I don't see in this discussion is whether or not the team is going to commit to accurately reporting versions on the file format going forward, and if so, what mechanisms we can put in place to make sure that we update it when we make changes to the format (since it seems the lightweight mechanism of code review isn't working). A simple stopgap is to have it report the version of PyTorch which exported the model...

lutzroeder added a commit to lutzroeder/netron that referenced this issue Jan 15, 2020
@lutzroeder
Copy link
Contributor Author

A simple stopgap is to have it report the version of PyTorch which exported the model...

Agree. Given the scheme we discussed above this would also make the most sense for tooling. The current /version stamp in TorchScript files hasn't been updated consistently and it is also not very useful or descriptive for tooling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority oncall: jit Add this issue/PR to JIT oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants
0