8000 Cache is never used because hashes do not match · Issue #389 · gwmod/nlmod · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Cache is never used because hashes do not match #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dbrakenhoff opened this issue Nov 29, 2024 · 13 comments · Fixed by #395 or #412 · May be fixed by #413
Open

Cache is never used because hashes do not match #389

dbrakenhoff opened this issue Nov 29, 2024 · 13 comments · Fixed by #395 or #412 · May be fixed by #413
Labels
caching All caching related issues

Comments

@dbrakenhoff
Copy link
Collaborator

At the moment I cannot use the caching functionality in one of my projects. Confirmed to happen for REGIS and AHN datasets.

For some reason the hashes never match, causing the cache to be invalidated. However, when I try reproducing this in a separate minimal example, the caching is working fine.

If anyone has any ideas, I'd love to hear them, otherwise as my investigation continues I will post updates here.

This is working fine...

import nlmod

nlmod.util.get_color_logger("DEBUG")

cachedir = "."
extent = [100_000, 101_000, 400_000, 401_000]

regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")

Versions:

Python version     : 3.11.10
NumPy version      : 1.26.4
Xarray version     : 2024.9.0
Matplotlib version : 3.9.2
Flopy version      : 3.9.0.dev1

nlmod version      : 0.9.0
@OnnoEbbens
Copy link
Collaborator
OnnoEbbens commented Dec 11, 2024

I tried the following in a GH codespace (Debian GNU/Linux 12 (bookworm):

  1. run script 1 (see below)
  2. restart the kernel
  3. run script 2 (see below)
  4. compare the hashes created in both scripts (see below)

What I get from this:

  • Reading the same cached netcdf file after restarting the kernel results in a different hash. I visually compared the two datasets (dimensions, coordinates, data variables and attributes) and I don't see any differences.
  • Same for reading the same pickled dataset.

I don't know why the hashes are different but I guess it is not something we can easily solve.

The intention of the hash is to check if the pickled function arguments where created together with the cached netcdf file. In other words if the .pklz and the .nc file belong together. This is nice to have check but the cache will work without. So I would propose to disable the hash check until we find a solution for this.

packageversions used:
python 3.11.10
nlmod 0.9.0
xarray 2023.6.0
dask 2024.12.0

script 1

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

#%%
# get regis dataset
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_orig = dask.base.tokenize(regis_ds)

# write to pickle
with open('regis_ds.pickle', 'wb') as handle:
    pickle.dump(regis_ds, handle, protocol=-1)

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache will be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashes.txt', 'w') as handle:
    handle.write(f'regis orig           : {hash_orig}\n')
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis cache indirect : {hash_cache_indirect}\n')

yields:

>>> INFO:nlmod.cache.wrapper:caching data -> regis.nc
>>> INFO:nlmod.cache.wrapper:using cached data -> regis.nc

script 2

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)
regis_from_cache.close()

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache won't be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashesv2.txt', 'w') as handle:
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis failed cache   : {hash_cache_indirect}\n')

yields:

INFO:nlmod.cache._same_function_arguments:cache was created using different function argument values, do not use cached data
INFO:nlmod.cache.wrapper:caching data -> regis.nc

hashes.txt:

regis orig           : ff4715f0946be500d54235e53b080e99
regis cache direct   : c408096ea4ae53ca7407605cf0ed6f33
regis pickle         : ff4715f0946be500d54235e53b080e99
regis cache indirect : c408096ea4ae53ca7407605cf0ed6f33

hashesv2.txt:

regis cache direct   : eea86620b9909cfd5b5cfbc0831b984d
regis pickle         : 05935639d8c6bf40843612b9082fb667
regis failed cache   : 05935639d8c6bf40843612b9082fb667

@OnnoEbbens OnnoEbbens added the caching All caching related issues label Dec 11, 2024
@OnnoEbbens OnnoEbbens linked a pull request Dec 16, 2024 that will close this issue
@bdestombe
Copy link
Collaborator

#395

@bdestombe
Copy link
Collaborator

Solved with #395

@github-project-automation github-project-automation bot moved this from Todo to Done in NHFLO Dec 22, 2024
@dbrakenhoff dbrakenhoff reopened this Feb 26, 2025
@dbrakenhoff
Copy link
Collaborator Author

This seems to be happening again. After a kernel restart, the dask.base.tokenize result on a dataset is different. This means comparing hashes between sessions always fails and causes the cache to be ignored.

To reproduce:

import nlmod
import xarray as xr
import dask

nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")

cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
print(dask.base.tokenize(regis_ds.to_dict()))

Output e.g.: 9af0a0cbca8e5622b223df291c9d4e97

Restart the kernel:

import xarray as xr
import dask

regis_from_cache = xr.open_dataset("regis.nc")
hash_cache_direct = dask.base.tokenize(regis_from_cache.to_dict())
print(hash_cache_direct)
regis_from_cache.close()

Output e.g.: 4292d5f37ef2bce8ec3f37095b9cc0fb

Package versions:

from importlib.metadata import version

{pkg: version(pkg) for pkg in ["dask", "xarray", "nlmod"]}

{'dask': '2024.9.0', 'xarray': '2025.1.2', 'nlmod': '0.9.2.dev0'}

@dbrakenhoff dbrakenhoff linked a pull request Feb 26, 2025 that will close this issue
@bdestombe
Copy link
Collaborator

In #395 we moved from Dask tokenize to Hashlib. Are you seeing the same behavior with hashlib?

@dbrakenhoff
Copy link
Collaborator Author

In #395 we moved from Dask tokenize to Hashlib. Are you seeing the same behavior with hashlib?

We only moved to hashlib for the hash of the entire netcdf. In this case dask.tokenize is used to compute hashes for data_vars and coords separately. I didn't really look into hashlib tbh. Would be nice if that solved it, but wasn't sure how to compute the hash for only the coordinates or data variables using that module...

@bdestombe
Copy link
Collaborator
bdestombe commented Feb 26, 2025

Would be nice if that solved it, but wasn't sure how to compute the hash for only the coordinates or data variables using that module...

Would something like the following work?

import hashlib
import json
import numpy as np
import xarray as xr

def hash_xarray_coord(coord):
    """
    Create a hash of an xarray coordinate object using array bytes and metadata.
    
    Parameters:
    -----------
    coord : xarray.core.coordinates.Coordinate
        The xarray coordinate object to hash
    
    Returns:
    --------
    str
        The hexadecimal hash string
    """
    # Get the raw bytes from the numpy array values
    values_bytes = coord.values.tobytes()
    
    # Get metadata as JSON
    metadata = {
        'name': coord.name,
        'dims': coord.dims,
        'attrs': coord.attrs,
        'dtype': str(coord.dtype),
        'shape': coord.shape
    }
    metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
    
    # Combine both sets of bytes for hashing
    combined_bytes = values_bytes + metadata_bytes
    
    # Create a hash of the combined bytes
    hash_obj = hashlib.sha256(combined_bytes)
    
    return hash_obj.hexdigest()

I'm not sure whether to also include the metadata in the hash or whether hashing just the values_bytes is sufficient..

@bdestombe
Copy link
Collaborator

Another suggestion would be, you could also fill an in-memory bytesIO object with file_buffer = bytesIO(); coords.to_netcdf(file_buffer);hash(file_buffer) which might provide a leaner solution

@dbrakenhoff
Copy link
Collaborator Author

The latter solution produces the same problem for me, the hash changes on load... There must be something going on within the xarray+netcdf that is produced that causes something to change every time the file is loaded?

import hashlib

bytes = ds.coords.to_dataset().to_netcdf()
hash1 = hashlib.sha256(bytes).hexdigest()

ds.to_netcdf("test.nc")

ds2 = xr.open_dataset("test.nc")
bytes2 = ds2.coords.to_dataset().to_netcdf()
hash2 = hashlib.sha256(bytes2).hexdigest()

Your first solution seems t 8000 o work for me.

@dbrakenhoff
Copy link
Collaborator Author
dbrakenhoff commented Feb 27, 2025

Maybe found the culprit too:

x
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
y
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
layer
a54432c15f27ceeacb2654fca1297758a5f030d3f355785bc9eda5ad1e60e143
d6cb94c7898e14751da4b843b100fe0f556c22802aca31a1496e012d8397be97
time
013e4e40c7a979af654dff4af06d7e0f2eead44eaeef37ab87c1572c54d21498
013e4e40c7a979af654dff4af06d7e0f2eead44eaeef37ab87c1572c54d21498

The layer dtype changes to <U6 from object when reloading from disk ...

EDIT: got the dtype order wrong. Object initialliy, <U6 after reload.

@bdestombe
Copy link
Collaborator
bdestombe commented Feb 27, 2025

The following should work then for the data vars

import hashlib
import json
import numpy as np
import xarray as xr


def hash_xarray_data_var(data_array):
    """
    Create a hash of an xarray DataArray object using array bytes and metadata.
    
    Parameters:
    -----------
    data_array : xarray.DataArray
        The xarray DataArray object to hash
    
    Returns:
    --------
    str
        The hexadecimal hash string
    """
    # Get the raw bytes from the numpy array values
    values_bytes = data_array.values.tobytes()
    
    # Hash each coordinate separately
    coord_hashes = {}
    for coord_name, coord in data_array.coords.items():
        coord_hashes[coord_name] = hash_xarray_coord(coord)
    
    # Get metadata as JSON
    metadata = {
        'name': data_array.name,
        'dims': data_array.dims,
        'attrs': data_array.attrs,
        'dtype': str(data_array.dtype),
        'shape': data_array.shape,
        'coord_hashes': coord_hashes
    }
    metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
    
    # Combine both sets of bytes for hashing
    combined_bytes = values_bytes + metadata_bytes
    
    # Create a hash of the combined bytes
    hash_obj = hashlib.sha256(combined_bytes)
    
    return hash_obj.hexdigest()

@dbrakenhoff
Copy link
Collaborator Author

This is a script to study what the differences are between downloaded REGIS and the copy loaded from netCDF. Maybe not the best example for our caching stuff, since the REGIS dataset doesn't need to be checked against its own stored copy but I thought it might give some insights.

Some observations:

  • layer coordinate is a different datatype after reading from file
  • layer coordinate attrs disappear after reading from file
  • spatial_ref is a mystery to me, it's in both datasets (downloaded, and read_from_cache) but if I don't drop it from the ds, I can't get any of the hashes to match.
  • including metadata when computing the hashes means they're all different (maybe makes sense for the layer coordinate given point 2, but the rest I don't know).
# %%
import hashlib

import dask
import nlmod
import xarray as xr

nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")

# %%
cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = regis_ds.drop("spatial_ref")

print(dask.base.tokenize(regis_ds.to_dict()))
print(hashlib.sha256(regis_ds.to_netcdf()).hexdigest())

for coord in regis_ds.coords:
    print(coord, regis_ds[coord].dtype)
    print(nlmod.cache.hash_xarray_coords(regis_ds[coord], include_metadata=False))

for da in regis_ds.data_vars:
    print(da, regis_ds[da].dtype)
    print(nlmod.cache.hash_xarray_data_vars(regis_ds[da], include_metadata=False))

# %%
regis_from_cache = xr.open_dataset("regis.nc")
regis_from_cache = regis_from_cache.drop("spatial_ref")
regis_from_cache = regis_from_cache.assign_coords(
    {"layer": regis_from_cache["layer"].values.astype(regis_ds["layer"].dtype)}
)

print(dask.base.tokenize(regis_from_cache.to_dict()))
print(hashlib.sha256(regis_from_cache.to_netcdf()).hexdigest())

for coord in regis_from_cache.coords:
    print(coord, regis_from_cache[coord].dtype)
    print(nlmod.cache.hash_xarray_coords(regis_from_cache[coord], include_metadata=False))

for da in regis_from_cache.data_vars:
    print(da, regis_from_cache[da].dtype)
    print(nlmod.cache.hash_xarray_data_vars(regis_from_cache[da], include_metadata=False))

@dbrakenhoff
Copy link
Collaborator Author

Wait for decision on #413 before closing this issue.

@dbrakenhoff dbrakenhoff reopened this Feb 28, 2025
@dbrakenhoff dbrakenhoff linked a pull request Feb 28, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching All caching related issues
Projects
None yet
3 participants
0