Add threshold weighted scores #609

nicholasloveday · 2024-07-12T05:56:25Z

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
Typehints added
Docstrings complete and followed Napoleon (google) style
Reference to paper/webpage is in docstring
Add error handling
Imported into the API

Testing of new xarray-based metrics

100% unit test coverage
Test that metric is compatible with dask.
Test that metrics work with inputs that contain NaNs
Test that broadcasting with xarray works
Test both reduce and preserve dims arguments work
Test that errors are raised as expected

Tutorial notebook

Short introduction to why you would use that metric and what it tells you
A link to a reference
A "things to try next" section at the end
Add notebook to Tutorial_Gallery.ipynb
Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

Add the score to the API documentation
Add the score to the included list of metrics and tools

Steph-Chong

Some minor feedback.

In included.md:

Capitalisation suggestions (for consistency with other table entries)
The four entries don't appear to be in alphabetical order

In the docstrings, the references don't appear to follow APA7.

Edit: Re. API docstrings rendering in readthedocs - I think you might need two back ticks (not just one) around function names. (You can check when it is put into branch 228, or if you build it locally in readthedocs).
Further Edit: it looks like sometimes you need one back tick and sometimes you need two back ticks. It looks like @nikeethr used a mix (depending on use case) in FSS API docstrings, and the FSS API docstrings appear to be rendering nicely in readthedocs.

docs/included.md

src/scores/continuous/__init__.py

tennlee · 2024-07-12T07:32:28Z

I haven't reviewed the tests yet, but so I don't neglect to mention it later, I will be looking to see whether there are some consistency checks against unweighted scores when the weights are set neutrally.

Co-authored-by: Stephanie Chong <168800785+Steph-Chong@users.noreply.github.com> Signed-off-by: Nicholas Loveday <48701367+nicholasloveday@users.noreply.github.com>

nicholasloveday · 2024-07-14T23:24:15Z

I haven't reviewed the tests yet, but so I don't neglect to mention it later, I will be looking to see whether there are some consistency checks against unweighted scores when the weights are set neutrally.

You will find these consistency tests in test_threshold_weighted_score1. Huber Loss isn't in scores yet, so I copied the Huber Loss output from an independent calculation.

Steph-Chong · 2024-07-30T01:38:02Z

I couldn't work out how to leave these as code comments.

Some very minor feedback about the tutorial header levels (as they currently appear in branch 228 - I didn't rebuild the docs, so apologies if these have already been addressed):

Header "Other Threshold Weighted Scores." has a full stop at the end. I am guessing the full stop isn't meant to be there.
Some of the capitalisation of the headers is inconsistent. In particular,
2a. "Rectangular weighting", could consider capitalising "w" in "weighting"
2b. "Trapezoidal weighting", could consider capitalising "w" in "weighting"
2c. "Things to try next", could consider capitalising "Things to Try Next" or "Things To Try Next"

Steph-Chong · 2024-07-30T01:47:02Z

In the "Other Threshold Weighted Scores" section, the first sentence says: "In this tutorial, we demonstrated the threshold weighted square error (or twMSE) function."

I think there may be a missing "mean" - as in "threshold weighted mean square error (or twMSE) function."

nicholasloveday · 2024-07-30T01:51:58Z

Thanks @Steph-Chong for those suggestions. I've made some updates that address your feedback.

nicholasloveday · 2024-07-30T01:54:21Z

There is maybe an issue with the plots again.
The plots don't appear to be rendering in:

readthedocs: https://scores.readthedocs.io/en/228-documentation-testing-branch/tutorials/Threshold_Weighted_Scores.html

nbviewer - from the link at the top of the readthedocs tutorial page): https://nbviewer.org/urls/scores.readthedocs.io/en/228-documentation-testing-branch/tutorials/Threshold_Weighted_Scores.ipynb

nbviewer - from the preview screen at the bottom of the binder launch page: https://mybinder.org/v2/gh/nci/scores/228-documentation-testing-branch?labpath=tutorials/Threshold_Weighted_Scores.ipynb

The plots do seem to be rendering in binder itself (once launched).
Note re. nbviewer: nbviewer can render differently in the two locations listed above.

Looks like all of this is linked to "whatever makes the notebooks work when they are built on my machine". I re-ran the notebook in my environment and pushed the change up to 228. I'm sure we'll get to the root cause eventually, but for now I will just have to re-run the notebooks each time I push to 228.

I think that the issue isn't the environment. When I run the notebooks in VS code they don't render the figures when I build the docs. When I run the notebooks in a a Jupyter notebook in a web browser, then they do render when I build the docs!

tennlee · 2024-07-31T00:12:32Z

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.

Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

tennlee · 2024-07-31T00:16:56Z

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.

tennlee · 2024-07-31T00:19:31Z

The trapezoidal weighting example given in the paragraph "Trapezoidal Weighting" isn't the same as the one presented in code in the example. Consider adding a plot to the "Trapezoidal Weighting" section to show what is described in the text, or consider adjusting the text description to match the later code and say "see below for an example"

tennlee · 2024-07-31T00:28:11Z

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.

One option here would be to replace the threshold_where_one with a callable which returned '1' when 1 and 0 when 0.

The nice thing about the current approach is that it's simpler for non-coders who might be less comfortable making up functions on the fly, and also to provide some guidance on the most common things to do. One way to support both approaches would be to be support some optional argument which would allow a function to be passed instead.

I also think we might need to explain why these weighting parameters are being passed in rather than using the already suitable "weights" argument to specify the weights. For example, why are we not using functions like weights_where_one or trap_weights to calculate a suitable weights array instead? Part of the answer is convenience and guidance for users, but I think it could be confusing to people who are thinking about the weights. Do the functions treat the weights in weights differently from the weightings which result frorm the various weighting parameters that are passed in?

tennlee · 2024-07-31T00:38:34Z

It might be interesting to add an example for a variable which has both positive and negative values, either where the interval of interest is around the mean, or where it's on the extremes. It's not clear to me how this can be used on something like temperature, where both very hot and very cold extremes are both of similar (although potentially non-equal) interest. I really like the examples that are currently included, and I'm a bit worried about overcomplicating the tutorial, but I want to make sure this approach is suitable for that situation and an example would both put the implementation to the test and demonstrate appropriate practise.

Examples could include for example human comfort ranges for temperature or humidity, agriculture (where both heat and frost/ice could be problems), or operating equipment like trains where again there would be a temperature range.

rob-taggart · 2024-07-31T01:04:19Z

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.

Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

@tennlee , they way you have put it is not quite correct.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0.

What Nick says is correct. MSE can be interpreted as putting equal weight on all thresholds. The Ehm 2016 paper explains why. Happy to talk about this in another forum rather than a gitlab comment

rob-taggart · 2024-07-31T01:11:57Z

@nicholasloveday , one of @tennlee comments about weights makes me wonder whether:

there is a weight argument in the threshold weighted scores, which applies specified weights to each of the individual score prior to calculating the mean across some dimensions; and
how this kind of weight differs from threshold-weighting, and whether that is clear in the documentation.

I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

tennlee · 2024-07-31T01:32:40Z

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.
Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

@tennlee , they way you have put it is not quite correct.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0.

What Nick says is correct. MSE can be interpreted as putting equal weight on all thresholds. The Ehm 2016 paper explains why. Happy to talk about this in another forum rather than a gitlab comment

Thanks, I've followed up offline.

My comments aren't about what's right and wrong, but about whether things are clear when introduced to a new reader with a general mathematical understanding of errors but without prior knowledge of decision thresholds. i.e. to help people avoid confusion about the exact disagreement being expressed in the comment.

tennlee · 2024-07-31T01:36:21Z

@nicholasloveday , one of @tennlee comments about weights makes me wonder whether:
1. there is a weight argument in the threshold weighted scores, which applies specified weights to each of the individual score prior to calculating the mean across some dimensions; and

2. how this kind of weight differs from threshold-weighting, and whether that is clear in the documentation.
I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

Yes, spot on.

tennlee · 2024-07-31T01:37:28Z

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0

This would be a nice example to include.

nicholasloveday · 2024-08-01T02:41:41Z

The trapezoidal weighting example given in the paragraph "Trapezoidal Weighting" isn't the same as the one presented in code in the example. Consider adding a plot to the "Trapezoidal Weighting" section to show what is described in the text, or consider adjusting the text description to match the later code and say "see below for an example"

Added a text description in the trapezoidal weighting paragraph. I added rather than replaced as it is important to show how trapezoidal weights can be on either one side or both sides of the rectangular weights.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0
This would be a nice example to include.

I've added a modified example that lends itself more to a range of decision thresholds rather than a single decision threshold

I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

I have updated the tutorial and docstrings to clarify this.

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.
It might be interesting to add an example for a variable which has both positive and negative values, either where the interval of interest is around the mean, or where it's on the extremes. It's not clear to me how this can be used on something like temperature, where both very hot and very cold extremes are both of similar (although potentially non-equal) interest. I really like the examples that are currently included, and I'm a bit worried about overcomplicating the tutorial, but I want to make sure this approach is suitable for that situation and an example would both put the implementation to the test and demonstrate appropriate practise.

Examples could include for example human comfort ranges for temperature or humidity, agriculture (where both heat and frost/ice could be problems), or operating equipment like trains where again there would be a temperature range.

Added an example in the tutorial. Also added an additional thing to try in the "things to try next" section.

…ted Scores notebook

nicholasloveday added 9 commits July 12, 2024 12:25

initial implementation

531f716

added initial tests

8cc8f66

add tests for huber loss

6db5be9

add ask tests

9ec2b01

test weights

6159c45

add new scores to api.md and readme

3b8ce3e

add to included

7ef35b7

update fcst test data to test test that broadcasting works

ef9587d

update docstrings

16550fc

nicholasloveday changed the title ~~(under development) 608 add threshold weighted scores~~ (still under development) 608 add threshold weighted scores Jul 12, 2024

run black

2ae66bd

nicholasloveday self-assigned this Jul 12, 2024

nicholasloveday added 3 commits July 12, 2024 16:01

run isort

4bf4749

add pylint ignore

dde86f2

fix api.md

3b39704

nicholasloveday changed the title ~~(still under development) 608 add threshold weighted scores~~ (still need to create tutorial) 608 add threshold weighted scores Jul 12, 2024

Steph-Chong reviewed Jul 12, 2024

View reviewed changes

docs/included.md Outdated Show resolved Hide resolved

docs/included.md Outdated Show resolved Hide resolved

docs/included.md Outdated Show resolved Hide resolved

docs/included.md Outdated Show resolved Hide resolved

tennlee reviewed Jul 12, 2024

View reviewed changes

src/scores/continuous/__init__.py Outdated Show resolved Hide resolved

nicholasloveday and others added 2 commits July 15, 2024 09:18

Apply suggestions from code review

fbb73e9

Co-authored-by: Stephanie Chong <168800785+Steph-Chong@users.noreply.github.com> Signed-off-by: Nicholas Loveday <48701367+nicholasloveday@users.noreply.github.com>

revert correlation change

abfd306

nicholasloveday added 3 commits July 16, 2024 10:10

produced tw scoring tutorial

28d6bca

update ref to tw notebook in consistent scores notebook

c2f31fa

tidy up tw score tutorial

85142b7

nicholasloveday changed the title ~~(still need to create tutorial) 608 add threshold weighted scores~~ (ready for review) 608 add threshold weighted scores Jul 16, 2024

nicholasloveday added 4 commits July 16, 2024 10:25

fix links to notebook

10844d1

update citation style

568fb21

fix order of functions in included

08a75b2

added pyfunc to funcs mentioned in docstring

512dd5b

change who to whose

0777533

further changes to headings

4e2ccf7

nicholasloveday added 4 commits August 1, 2024 09:38

expand text around trapezoidal weighting

2682bf7

improve clarity around threshold weights vs weights

8e7e9a0

improved docstrings when talking about threshold weighting

97b227f

add tutorial example with hot cold extremes

c9f19ef

nicholasloveday and others added 5 commits August 1, 2024 14:24

respond to further feedback

7186d1e

add additional sentence about hedging non-human forecasts

c51bbff

Suggested language change for the introduction to the Threshold Weigh…

3b89fe5

…ted Scores notebook

Update PR with changes from review.

dfb2008

Reword of the introduction to introduce the concepts more clearly

724a6cd

tennlee changed the title ~~(ready for review) 608 add threshold weighted scores~~ 608 add threshold weighted scores Aug 2, 2024

Merge branch 'develop' into 608-add-threshold-weighted-scores

409eb7f

tennlee merged commit 3f7f7a9 into nci:develop Aug 2, 2024
10 checks passed

tennlee changed the title ~~608 add threshold weighted scores~~ Add threshold weighted scores Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add threshold weighted scores #609

Add threshold weighted scores #609

Add threshold weighted scores #609

Add threshold weighted scores #609

Conversation

Development for new xarray-based metrics

Testing of new xarray-based metrics

Tutorial notebook

Documentation

Choose a reason for hiding this comment