-
Notifications
You must be signed in to change notification settings - Fork 5
Cache is never used because hashes do not match #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache is never used because hashes do not match #389
Comments
I tried the following in a GH codespace (Debian GNU/Linux 12 (bookworm):
What I get from this:
I don't know why the hashes are different but I guess it is not something we can easily solve. The intention of the hash is to check if the pickled function arguments where created together with the cached netcdf file. In other words if the .pklz and the .nc file belong together. This is nice to have check but the cache will work without. So I would propose to disable the hash check until we find a solution for this. packageversions used: script 1
yields:
script 2
yields:
hashes.txt:
hashesv2.txt:
|
Solved with #395 |
This seems to be happening again. After a kernel restart, the To reproduce: import nlmod
import xarray as xr
import dask
nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")
cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
print(dask.base.tokenize(regis_ds.to_dict())) Output e.g.: Restart the kernel: import xarray as xr
import dask
regis_from_cache = xr.open_dataset("regis.nc")
hash_cache_direct = dask.base.tokenize(regis_from_cache.to_dict())
print(hash_cache_direct)
regis_from_cache.close() Output e.g.: Package versions: from importlib.metadata import version
{pkg: version(pkg) for pkg in ["dask", "xarray", "nlmod"]}
|
In #395 we moved from Dask tokenize to Hashlib. Are you seeing the same behavior with hashlib? |
We only moved to hashlib for the hash of the entire netcdf. In this case |
Would something like the following work? import hashlib
import json
import numpy as np
import xarray as xr
def hash_xarray_coord(coord):
"""
Create a hash of an xarray coordinate object using array bytes and metadata.
Parameters:
-----------
coord : xarray.core.coordinates.Coordinate
The xarray coordinate object to hash
Returns:
--------
str
The hexadecimal hash string
"""
# Get the raw bytes from the numpy array values
values_bytes = coord.values.tobytes()
# Get metadata as JSON
metadata = {
'name': coord.name,
'dims': coord.dims,
'attrs': coord.attrs,
'dtype': str(coord.dtype),
'shape': coord.shape
}
metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
# Combine both sets of bytes for hashing
combined_bytes = values_bytes + metadata_bytes
# Create a hash of the combined bytes
hash_obj = hashlib.sha256(combined_bytes)
return hash_obj.hexdigest() I'm not sure whether to also include the metadata in the hash or whether hashing just the values_bytes is sufficient.. |
Another suggestion would be, you could also fill an in-memory bytesIO object with file_buffer = bytesIO(); coords.to_netcdf(file_buffer);hash(file_buffer) which might provide a leaner solution |
The latter solution produces the same problem for me, the hash changes on load... There must be something going on within the xarray+netcdf that is produced that causes something to change every time the file is loaded? import hashlib
bytes = ds.coords.to_dataset().to_netcdf()
hash1 = hashlib.sha256(bytes).hexdigest()
ds.to_netcdf("test.nc")
ds2 = xr.open_dataset("test.nc")
bytes2 = ds2.coords.to_dataset().to_netcdf()
hash2 = hashlib.sha256(bytes2).hexdigest() Your first solution seems t 8000 o work for me. |
Maybe found the culprit too:
The layer dtype changes to <U6 from object when reloading from disk ... EDIT: got the dtype order wrong. Object initialliy, <U6 after reload. |
The following should work then for the data vars import hashlib
import json
import numpy as np
import xarray as xr
def hash_xarray_data_var(data_array):
"""
Create a hash of an xarray DataArray object using array bytes and metadata.
Parameters:
-----------
data_array : xarray.DataArray
The xarray DataArray object to hash
Returns:
--------
str
The hexadecimal hash string
"""
# Get the raw bytes from the numpy array values
values_bytes = data_array.values.tobytes()
# Hash each coordinate separately
coord_hashes = {}
for coord_name, coord in data_array.coords.items():
coord_hashes[coord_name] = hash_xarray_coord(coord)
# Get metadata as JSON
metadata = {
'name': data_array.name,
'dims': data_array.dims,
'attrs': data_array.attrs,
'dtype': str(data_array.dtype),
'shape': data_array.shape,
'coord_hashes': coord_hashes
}
metadata_bytes = json.dumps(metadata, sort_keys=True).encode('utf-8')
# Combine both sets of bytes for hashing
combined_bytes = values_bytes + metadata_bytes
# Create a hash of the combined bytes
hash_obj = hashlib.sha256(combined_bytes)
return hash_obj.hexdigest() |
This is a script to study what the differences are between downloaded REGIS and the copy loaded from netCDF. Maybe not the best example for our caching stuff, since the REGIS dataset doesn't need to be checked against its own stored copy but I thought it might give some insights. Some observations:
# %%
import hashlib
import dask
import nlmod
import xarray as xr
nlmod.util.get_color_logger("DEBUG", logger_name="nlmod")
# %%
cachedir = "."
extent = [204800, 205000, 438000, 438200]
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = regis_ds.drop("spatial_ref")
print(dask.base.tokenize(regis_ds.to_dict()))
print(hashlib.sha256(regis_ds.to_netcdf()).hexdigest())
for coord in regis_ds.coords:
print(coord, regis_ds[coord].dtype)
print(nlmod.cache.hash_xarray_coords(regis_ds[coord], include_metadata=False))
for da in regis_ds.data_vars:
print(da, regis_ds[da].dtype)
print(nlmod.cache.hash_xarray_data_vars(regis_ds[da], include_metadata=False))
# %%
regis_from_cache = xr.open_dataset("regis.nc")
regis_from_cache = regis_from_cache.drop("spatial_ref")
regis_from_cache = regis_from_cache.assign_coords(
{"layer": regis_from_cache["layer"].values.astype(regis_ds["layer"].dtype)}
)
print(dask.base.tokenize(regis_from_cache.to_dict()))
print(hashlib.sha256(regis_from_cache.to_netcdf()).hexdigest())
for coord in regis_from_cache.coords:
print(coord, regis_from_cache[coord].dtype)
print(nlmod.cache.hash_xarray_coords(regis_from_cache[coord], include_metadata=False))
for da in regis_from_cache.data_vars:
print(da, regis_from_cache[da].dtype)
print(nlmod.cache.hash_xarray_data_vars(regis_from_cache[da], include_metadata=False)) |
Wait for decision on #413 before closing this issue. |
At the moment I cannot use the caching functionality in one of my projects. Confirmed to happen for REGIS and AHN datasets.
For some reason the hashes never match, causing the cache to be invalidated. However, when I try reproducing this in a separate minimal example, the caching is working fine.
If anyone has any ideas, I'd love to hear them, otherwise as my investigation continues I will post updates here.
This is working fine...
Versions:
The text was updated successfully, but these errors were encountered: