Suggest the correct name when no key matches in the dataset #9943

Illviljan · 2025-01-12T22:54:05Z

I found the error when I make a typo on the dataset keys not so helpful. The truncated list of variables hides all the ones that I wanted to see. Instead, add a fuzzy matching function that does the typical "Did you mean X".

Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

possibilities = ("blech", "gray_r", 1, None, (2, 56))
# possibilities = tuple(f"long_name_{i}" for i in range(0, 1000))
t = np.arange(80)
var = xr.Variable(dims=("time",), data=np.arange(t.size))
data_vars = {v: ("time", var) for v in possibilities}
ds = xr.Dataset(data_vars, coords={"T": t})
ds["bluch"]
# KeyError: "No variable named 'bluch'. Did you mean one of ('blech',)?"

# Performance is pretty much linear with number of variables:
%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 1000)))
35.1 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 100)))
3.29 ms ± 104 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 10)))
320 μs ± 10.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# Number of suggestions has no significant effect:
%timeit xr.core.utils.did_you_mean("long_name9o5", tuple(f"long_name_{i}" for i in range(0, 1000)), n=10)
34.6 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Further reading:
python/cpython#16850
matplotlib/matplotlib#28115
https://en.wikipedia.org/wiki/Levenshtein_distance

dcherian · 2025-01-12T23:00:18Z

xarray/core/dataset.py

@@ -1611,6 +1611,11 @@ def __getitem__(
                return self._construct_dataarray(key)
            except KeyError as e:
                message = f"No variable named {key!r}. Variables on the dataset include {shorten_list_repr(list(self.variables.keys()), max_items=10)}"
+
+                best_guess = utils.did_you_mean(key, self.variables.keys())


Amazing idea. I would print the best guess first, and then any others so that's it's easy to see

Maybe should just remove the "Variables on the dataset include ..." ? They try to do the same thing I think.

Yeah you could sort the whole list by similarity and then print that (truncated as above)

Now it prioritizes best_guess. If best_guess is empty you could be working in the wrong dataset, so it's still nice to get some kind of clue which dataset you're using.

xarray/core/utils.py

max-sixty · 2025-01-13T04:32:55Z

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

headtr1ck · 2025-01-13T10:25:43Z

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

We could add an LRU cache over did_you_mean if this is an issue

max-sixty · 2025-01-13T21:13:55Z

We could add an LRU cache over did_you_mean if this is an issue

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

Overall I say let's go ahead and we can reassess if we hear reports of slowdowns. Folks can use in / __contains__ for a fast path...

Illviljan · 2025-01-14T23:02:21Z

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

I thought about this case as well, my initial idea was to just use ds.get(x) then. But turns out .get is one of the few ones we inherit from collections.Mapping and it's pretty much try:; ds[x]. We can probably do that one better as well if it becomes a problem.

max-sixty · 2025-01-17T20:50:01Z

Extremely cool!


[ins] In [2]: ds['ar']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/workspace/xarray/xarray/core/dataset.py in ?(self, name)
   1512             variable = self._variables[name]
   1513         except KeyError:
-> 1514             _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
   1515

KeyError: 'ar'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/workspace/xarray/xarray/core/dataset.py in ?(self, key)
   1622                 if isinstance(key, tuple):
   1623                     message += f"\nHint: use a list to select multiple variables, for example `ds[{list(key)}]`"
-> 1624                 raise KeyError(message) from e
   1625

~/workspace/xarray/xarray/core/dataset.py in ?(self, name)
   1512             variable = self._variables[name]
   1513         except KeyError:
-> 1514             _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
   1515

~/workspace/xarray/xarray/core/dataset.py in ?(variables, key, dim_sizes)
    219     split_key = key.split(".", 1)
    220     if len(split_key) != 2:
--> 221         raise KeyError(key)
    222

KeyError: 'ar'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-2-5bf134430f5a> in ?()
----> 1 ds['ar']

~/workspace/xarray/xarray/core/dataset.py in ?(self, key)
   1620
   1621                 # If someone attempts `ds['foo' , 'bar']` instead of `ds[['foo', 'bar']]`
   1622                 if isinstance(key, tuple):
   1623                     message += f"\nHint: use a list to select multiple variables, for example `ds[{list(key)}]`"
-> 1624                 raise KeyError(message) from e
   1625
   1626         if utils.iterable_of_hashable(key):
   1627             return self._copy_listed(key)

KeyError: "No variable named 'ar'. Did you mean one of ('air',)?"

Would definitely be up for doing more like this.

* main: (79 commits) fix mean for datetime-like using the respective time resolution unit (#9977) Add `time_unit` argument to `CFTimeIndex.to_datetimeindex` (#9965) remove gate and add a test (#9958) Remove repetitive that (replace it with the) (#9994) add shxarray to the xarray ecosystem list (#9995) Add `shards` to `valid_encodings` to enable sharded Zarr writing (#9948) Use flox for grouped first, last (#9986) Bump the actions group with 2 updates (#9989) Fix some typing (#9988) Remove unnecessary a article (#9980) Fix test_doc_example on big-endian systems (#9949) fix weighted polyfit for arrays with more than 2 dimensions (#9974) Use zarr-fixture to prevent thread leakage errors (#9967) remove dask-expr from CI runs, fix related tests (#9971) Update time coding tests to assert exact equality (#9961) cast type to PDDatetimeUnitOptions (#9963) Suggest the correct name when no key matches in the dataset (#9943) fix upstream dev issues (#9953) Relax nanosecond datetime restriction in CF time decoding (#9618) Remove outdated quantile test. (#9945) ...

Illviljan added 2 commits January 12, 2025 23:36

Add "did you mean" function

fc55271

improve error for wrong key in dataset

ffc645e

dcherian reviewed Jan 12, 2025

View reviewed changes

xarray/core/utils.py Outdated Show resolved Hide resolved

Illviljan added 4 commits January 13, 2025 00:55

Prioritize best guess

22cf0da

increase number of valid suggestions to match previous idea

bc07c20

Update dataset.py

ed6d600

Update utils.py

f26f9d8

dcherian approved these changes Jan 13, 2025

View reviewed changes

Illviljan added the plan to merge Final call for comments label Jan 14, 2025

TomNicholas added the topic-error reporting label Jan 17, 2025

Illviljan added 3 commits January 17, 2025 18:21

Update whats-new.rst

a029873

Merge branch 'main' into did_you_mean

ba95499

Update whats-new.rst

86ce18b

Illviljan merged commit 70997ef into pydata:main Jan 17, 2025
23 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest the correct name when no key matches in the dataset #9943

Suggest the correct name when no key matches in the dataset #9943

Suggest the correct name when no key matches in the dataset #9943

Suggest the correct name when no key matches in the dataset #9943

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment