Cache is never used because hashes do not match #389

dbrakenhoff · 2024-11-29T15:17:10Z

At the moment I cannot use the caching functionality in one of my projects. Confirmed to happen for REGIS and AHN datasets.

For some reason the hashes never match, causing the cache to be invalidated. However, when I try reproducing this in a separate minimal example, the caching is working fine.

If anyone has any ideas, I'd love to hear them, otherwise as my investigation continues I will post updates here.

This is working fine...

import nlmod

nlmod.util.get_color_logger("DEBUG")

cachedir = "."
extent = [100_000, 101_000, 400_000, 401_000]

regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")

Versions:

Python version     : 3.11.10
NumPy version      : 1.26.4
Xarray version     : 2024.9.0
Matplotlib version : 3.9.2
Flopy version      : 3.9.0.dev1

nlmod version      : 0.9.0

The text was updated successfully, but these errors were encountered:

OnnoEbbens · 2024-12-11T15:15:34Z

I tried the following in a GH codespace (Debian GNU/Linux 12 (bookworm):

run script 1 (see below)
restart the kernel
run script 2 (see below)
compare the hashes created in both scripts (see below)

What I get from this:

Reading the same cached netcdf file after restarting the kernel results in a different hash. I visually compared the two datasets (dimensions, coordinates, data variables and attributes) and I don't see any differences.
Same for reading the same pickled dataset.

I don't know why the hashes are different but I guess it is not something we can easily solve.

The intention of the hash is to check if the pickled function arguments where created together with the cached netcdf file. In other words if the .pklz and the .nc file belong together. This is nice to have check but the cache will work without. So I would propose to disable the hash check until we find a solution for this.

packageversions used:
python 3.11.10
nlmod 0.9.0
xarray 2023.6.0
dask 2024.12.0

script 1

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

#%%
# get regis dataset
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_orig = dask.base.tokenize(regis_ds)

# write to pickle
with open('regis_ds.pickle', 'wb') as handle:
    pickle.dump(regis_ds, handle, protocol=-1)

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache will be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashes.txt', 'w') as handle:
    handle.write(f'regis orig           : {hash_orig}\n')
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis cache indirect : {hash_cache_indirect}\n')

yields:

>>> INFO:nlmod.cache.wrapper:caching data -> regis.nc
>>> INFO:nlmod.cache.wrapper:using cached data -> regis.nc

script 2

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)
regis_from_cache.close()

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache won't be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashesv2.txt', 'w') as handle:
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis failed cache   : {hash_cache_indirect}\n')

yields:

INFO:nlmod.cache._same_function_arguments:cache was created using different function argument values, do not use cached data
INFO:nlmod.cache.wrapper:caching data -> regis.nc

hashes.txt:

regis orig           : ff4715f0946be500d54235e53b080e99
regis cache direct   : c408096ea4ae53ca7407605cf0ed6f33
regis pickle         : ff4715f0946be500d54235e53b080e99
regis cache indirect : c408096ea4ae53ca7407605cf0ed6f33

hashesv2.txt:

regis cache direct   : eea86620b9909cfd5b5cfbc0831b984d
regis pickle         : 05935639d8c6bf40843612b9082fb667
regis failed cache   : 05935639d8c6bf40843612b9082fb667

bdestombe · 2024-12-18T09:36:36Z

#395

bdestombe · 2024-12-22T08:51:11Z

Solved with #395

dbrakenhoff · 2025-02-26T13:48:47Z

This seems to be happening again. After a kernel restart, the dask.base.tokenize result on a dataset is different. This means comparing hashes between sessions always fails and causes the cache to be ignored.

To reproduce:

import nlmod
import xarray as xr
import dask

nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")

cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
print(dask.base.tokenize(regis_ds.to_dict()))

Output e.g.: 9af0a0cbca8e5622b223df291c9d4e97

Restart the kernel:

import xarray as xr
import dask

regis_from_cache = xr.open_dataset("regis.nc")
hash_cache_direct = dask.base.tokenize(regis_from_cache.to_dict())
print(hash_cache_direct)
regis_from_cache.close()

Output e.g.: 4292d5f37ef2bce8ec3f37095b9cc0fb

Package versions:

from importlib.metadata import version

{pkg: version(pkg) for pkg in ["dask", "xarray", "nlmod"]}

{'dask': '2024.9.0', 'xarray': '2025.1.2', 'nlmod': '0.9.2.dev0'}

bdestombe · 2025-02-26T15:29:02Z

In #395 we moved from Dask tokenize to Hashlib. Are you seeing the same behavior with hashlib?

dbrakenhoff · 2025-02-26T17:03:04Z

In #395 we moved from Dask tokenize to Hashlib. Are you seeing the same behavior with hashlib?

We only moved to hashlib for the hash of the entire netcdf. In this case dask.tokenize is used to compute hashes for data_vars and coords separately. I didn't really look into hashlib tbh. Would be nice if that solved it, but wasn't sure how to compute the hash for only the coordinates or data variables using that module...

bdestombe · 2025-02-26T20:18:48Z

Would be nice if that solved it, but wasn't sure how to compute the hash for only the coordinates or data variables using that module...

Would something like the following work?

import hashlib
import json
import numpy as np
import xarray as xr

def hash_xarray_coord(coord):
    """
    Create a hash of an xarray coordinate object using array bytes and metadata.
    
    Parameters:
    -----------
    coord : xarray.core.coordinates.Coordinate
        The xarray coordinate object to hash
    
    Returns:
    --------
    str
        The hexadecimal hash string
    """
    # Get the raw bytes from the numpy array values
    values_bytes = coord.values.tobytes()
    
    # Get metadata as JSON
    metadata = {
        'name': coord.name,
        'dims': coord.dims,
        'attrs': coord.attrs,
        'dtype': str(coord.dtype),
        'shape': coord.shape
    }
    metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
    
    # Combine both sets of bytes for hashing
    combined_bytes = values_bytes + metadata_bytes
    
    # Create a hash of the combined bytes
    hash_obj = hashlib.sha256(combined_bytes)
    
    return hash_obj.hexdigest()

I'm not sure whether to also include the metadata in the hash or whether hashing just the values_bytes is sufficient..

bdestombe · 2025-02-27T06:26:03Z

Another suggestion would be, you could also fill an in-memory bytesIO object with file_buffer = bytesIO(); coords.to_netcdf(file_buffer);hash(file_buffer) which might provide a leaner solution

dbrakenhoff · 2025-02-27T14:44:03Z

The latter solution produces the same problem for me, the hash changes on load... There must be something going on within the xarray+netcdf that is produced that causes something to change every time the file is loaded?

import hashlib

bytes = ds.coords.to_dataset().to_netcdf()
hash1 = hashlib.sha256(bytes).hexdigest()

ds.to_netcdf("test.nc")

ds2 = xr.open_dataset("test.nc")
bytes2 = ds2.coords.to_dataset().to_netcdf()
hash2 = hashlib.sha256(bytes2).hexdigest()

Your first solution seems t 8000 o work for me.

dbrakenhoff · 2025-02-27T14:50:50Z

Maybe found the culprit too:

x
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
y
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
fb19a17d67624fa961de3b068ae6e097b25e82aae6c232e8abdcb9f4b45224e2
layer
a54432c15f27ceeacb2654fca1297758a5f030d3f355785bc9eda5ad1e60e143
d6cb94c7898e14751da4b843b100fe0f556c22802aca31a1496e012d8397be97
time
013e4e40c7a979af654dff4af06d7e0f2eead44eaeef37ab87c1572c54d21498
013e4e40c7a979af654dff4af06d7e0f2eead44eaeef37ab87c1572c54d21498

The layer dtype changes to <U6 from object when reloading from disk ...

EDIT: got the dtype order wrong. Object initialliy, <U6 after reload.

bdestombe · 2025-02-27T18:09:13Z

The following should work then for the data vars

import hashlib
import json
import numpy as np
import xarray as xr


def hash_xarray_data_var(data_array):
    """
    Create a hash of an xarray DataArray object using array bytes and metadata.
    
    Parameters:
    -----------
    data_array : xarray.DataArray
        The xarray DataArray object to hash
    
    Returns:
    --------
    str
        The hexadecimal hash string
    """
    # Get the raw bytes from the numpy array values
    values_bytes = data_array.values.tobytes()
    
    # Hash each coordinate separately
    coord_hashes = {}
    for coord_name, coord in data_array.coords.items():
        coord_hashes[coord_name] = hash_xarray_coord(coord)
    
    # Get metadata as JSON
    metadata = {
        'name': data_array.name,
        'dims': data_array.dims,
        'attrs': data_array.attrs,
        'dtype': str(data_array.dtype),
        'shape': data_array.shape,
        'coord_hashes': coord_hashes
    }
    metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
    
    # Combine both sets of bytes for hashing
    combined_bytes = values_bytes + metadata_bytes
    
    # Create a hash of the combined bytes
    hash_obj = hashlib.sha256(combined_bytes)
    
    return hash_obj.hexdigest()

dbrakenhoff · 2025-02-28T15:48:13Z

This is a script to study what the differences are between downloaded REGIS and the copy loaded from netCDF. Maybe not the best example for our caching stuff, since the REGIS dataset doesn't need to be checked against its own stored copy but I thought it might give some insights.

Some observations:

layer coordinate is a different datatype after reading from file
layer coordinate attrs disappear after reading from file
spatial_ref is a mystery to me, it's in both datasets (downloaded, and read_from_cache) but if I don't drop it from the ds, I can't get any of the hashes to match.
including metadata when computing the hashes means they're all different (maybe makes sense for the layer coordinate given point 2, but the rest I don't know).

# %%
import hashlib

import dask
import nlmod
import xarray as xr

nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")

# %%
cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = regis_ds.drop("spatial_ref")

print(dask.base.tokenize(regis_ds.to_dict()))
print(hashlib.sha256(regis_ds.to_netcdf()).hexdigest())

for coord in regis_ds.coords:
    print(coord, regis_ds[coord].dtype)
    print(nlmod.cache.hash_xarray_coords(regis_ds[coord], include_metadata=False))

for da in regis_ds.data_vars:
    print(da, regis_ds[da].dtype)
    print(nlmod.cache.hash_xarray_data_vars(regis_ds[da], include_metadata=False))

# %%
regis_from_cache = xr.open_dataset("regis.nc")
regis_from_cache = regis_from_cache.drop("spatial_ref")
regis_from_cache = regis_from_cache.assign_coords(
    {"layer": regis_from_cache["layer"].values.astype(regis_ds["layer"].dtype)}
)

print(dask.base.tokenize(regis_from_cache.to_dict()))
print(hashlib.sha256(regis_from_cache.to_netcdf()).hexdigest())

for coord in regis_from_cache.coords:
    print(coord, regis_from_cache[coord].dtype)
    print(nlmod.cache.hash_xarray_coords(regis_from_cache[coord], include_metadata=False))

for da in regis_from_cache.data_vars:
    print(da, regis_from_cache[da].dtype)
    print(nlmod.cache.hash_xarray_data_vars(regis_from_cache[da], include_metadata=False))

dbrakenhoff · 2025-02-28T16:07:02Z

Wait for decision on #413 before closing this issue.

github-project-automation bot added this to NHFLO Nov 29, 2024

github-project-automation bot moved this to Todo in NHFLO Nov 29, 2024

OnnoEbbens added the caching All caching related issues label Dec 11, 2024

OnnoEbbens mentioned this issue Dec 16, 2024

Temporary disable hash #393

Closed

OnnoEbbens linked a pull request Dec 16, 2024 that will close this issue

Temporary disable hash #393

Closed

bdestombe linked a pull request Dec 22, 2024 that will close this issue

[cache] Validate cached file with hashlib instead of dask tokenize #395

Merged

bdestombe closed this as completed Dec 22, 2024

github-project-automation bot moved this from Todo to Done in NHFLO Dec 22, 2024

dbrakenhoff reopened this Feb 26, 2025

dbrakenhoff mentioned this issue Feb 26, 2025

Add caching options #412

Merged

dbrakenhoff linked a pull request Feb 26, 2025 that will close this issue

Add caching options #412

Merged

dbrakenhoff closed this as completed in #412 Feb 28, 2025

dbrakenhoff reopened this Feb 28, 2025

dbrakenhoff linked a pull request Feb 28, 2025 that will close this issue

Use hashlib instead of dask.base.tokenize #413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache is never used because hashes do not match #389

Cache is never used because hashes do not match #389

Cache is never used because hashes do not match #389

Cache is never used because hashes do not match #389

Comments