8000 Add threshold weighted scores by nicholasloveday · Pull Request #609 · nci/scores · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add threshold weighted scores #609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nicholasloveday
Copy link
Collaborator
@nicholasloveday nicholasloveday commented Jul 12, 2024

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

  • Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
  • Typehints added
  • Docstrings complete and followed Napoleon (google) style
  • Reference to paper/webpage is in docstring
  • Add error handling
  • Imported into the API

Testing of new xarray-based metrics

  • 100% unit test coverage
  • Test that metric is compatible with dask.
  • Test that metrics work with inputs that contain NaNs
  • Test that broadcasting with xarray works
  • Test both reduce and preserve dims arguments work
  • Test that errors are raised as expected

Tutorial notebook

  • Short introduction to why you would use that metric and what it tells you
  • A link to a reference
  • A "things to try next" section at the end
  • Add notebook to Tutorial_Gallery.ipynb
  • Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

@nicholasloveday nicholasloveday changed the title (under development) 608 add threshold weighted scores (still under development) 608 add threshold weighted scores Jul 12, 2024
@nicholasloveday nicholasloveday self-assigned this Jul 12, 2024
@nicholasloveday nicholasloveday changed the title (still under development) 608 add threshold weighted scores (still need to create tutorial) 608 add threshold weighted scores Jul 12, 2024
Copy link
Collaborator
@Steph-Chong Steph-Chong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor feedback.

In included.md:

  1. Capitalisation suggestions (for consistency with other table entries)
  2. The four entries don't appear to be in alphabetical order

In the docstrings, the references don't appear to follow APA7.

Edit: Re. API docstrings rendering in readthedocs - I think you might need two back ticks (not just one) around function names. (You can check when it is put into branch 228, or if you build it locally in readthedocs).
Further Edit: it looks like sometimes you need one back tick and sometimes you need two back ticks. It looks like @nikeethr used a mix (depending on use case) in FSS API docstrings, and the FSS API docstrings appear to be rendering nicely in readthedocs.

@tennlee
Copy link
Collaborator
tennlee commented Jul 12, 2024

I haven't reviewed the tests yet, but so I don't neglect to mention it later, I will be looking to see whether there are some consistency checks against unweighted scores when the weights are set neutrally.

nicholasloveday and others added 2 commits July 15, 2024 09:18
Co-authored-by: Stephanie Chong <168800785+Steph-Chong@users.noreply.github.com>
Signed-off-by: Nicholas Loveday <48701367+nicholasloveday@users.noreply.github.com>
@nicholasloveday
Copy link
Collaborator Author

I haven't reviewed the tests yet, but so I don't neglect to mention it later, I will be looking to see whether there are some consistency checks against unweighted scores when the weights are set neutrally.

You will find these consistency tests in test_threshold_weighted_score1. Huber Loss isn't in scores yet, so I copied the Huber Loss output from an independent calculation.

@nicholasloveday nicholasloveday changed the title (still need to create tutorial) 608 add threshold weighted scores (ready for review) 608 add threshold weighted scores Jul 16, 2024
@Steph-Chong
Copy link
Collaborator
Steph-Chong commented Jul 30, 2024

I couldn't work out how to leave these as code comments.

Some very minor feedback about the tutorial header levels (as they currently appear in branch 228 - I didn't rebuild the docs, so apologies if these have already been addressed):

  1. Header "Other Threshold Weighted Scores." has a full stop at the end. I am guessing the full stop isn't meant to be there.
  2. Some of the capitalisation of the headers is inconsistent. In particular,
    2a. "Rectangular weighting", could consider capitalising "w" in "weighting"
    2b. "Trapezoidal weighting", could consider capitalising "w" in "weighting"
    2c. "Things to try next", could consider capitalising "Things to Try Next" or "Things To Try Next"

@Steph-Chong
Copy link
Collaborator

In the "Other Threshold Weighted Scores" section, the first sentence says: "In this tutorial, we demonstrated the threshold weighted square error (or twMSE) function."

I think there may be a missing "mean" - as in "threshold weighted mean square error (or twMSE) function."

@nicholasloveday
Copy link
Collaborator Author

Thanks @Steph-Chong for those suggestions. I've made some updates that address your feedback.

@nicholasloveday
Copy link
Collaborator Author
nicholasloveday commented Jul 30, 2024

There is maybe an issue with the plots again.
The plots don't appear to be rendering in:

  1. readthedocs: https://scores.readthedocs.io/en/228-documentation-testing-branch/tutorials/Threshold_Weighted_Scores.html
  2. nbviewer - from the link at the top of the readthedocs tutorial page): https://nbviewer.org/urls/scores.readthedocs.io/en/228-documentation-testing-branch/tutorials/Threshold_Weighted_Scores.ipynb
  3. nbviewer - from the preview screen at the bottom of the binder launch page: https://mybinder.org/v2/gh/nci/scores/228-documentation-testing-branch?labpath=tutorials/Threshold_Weighted_Scores.ipynb

The plots do seem to be rendering in binder itself (once launched).
Note re. nbviewer: nbviewer can render differently in the two locations listed above.

Looks like all of this is linked to "whatever makes the notebooks work when they are built on my machine". I re-ran the notebook in my environment and pushed the change up to 228. I'm sure we'll get to the root cause eventually, but for now I will just have to re-run the notebooks each time I push to 228.

I think that the issue isn't the environment. When I run the notebooks in VS code they don't render the figures when I build the docs. When I run the notebooks in a a Jupyter notebook in a web browser, then they do render when I build the docs!

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.

Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

The trapezoidal weighting example given in the paragraph "Trapezoidal Weighting" isn't the same as the one presented in code in the example. Consider adding a plot to the "Trapezoidal Weighting" section to show what is described in the text, or consider adjusting the text description to match the later code and say "see below for an example"

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.

One option here would be to replace the threshold_where_one with a callable which returned '1' when 1 and 0 when 0.

The nice thing about the current approach is that it's simpler for non-coders who might be less comfortable making up functions on the fly, and also to provide some guidance on the most common things to do. One way to support both approaches would be to be support some optional argument which would allow a function to be passed instead.

I also think we might need to explain why these weighting parameters are being passed in rather than using the already suitable "weights" argument to specify the weights. For example, why are we not using functions like weights_where_one or trap_weights to calculate a suitable weights array instead? Part of the answer is convenience and guidance for users, but I think it could be confusing to people who are thinking about the weights. Do the functions treat the weights in weights differently from the weightings which result frorm the various weighting parameters that are passed in?

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

It might be interesting to add an example for a variable which has both positive and negative values, either where the interval of interest is around the mean, or where it's on the extremes. It's not clear to me how this can be used on something like temperature, where both very hot and very cold extremes are both of similar (although potentially non-equal) interest. I really like the examples that are currently included, and I'm a bit worried about overcomplicating the tutorial, but I want to make sure this approach is suitable for that situation and an example would both put the implementation to the test and demonstrate appropriate practise.

Examples could include for example human comfort ranges for temperature or humidity, agriculture (where both heat and frost/ice could be problems), or operating equipment like trains where again there would be a temperature range.

@rob-taggart
Copy link
Collaborator

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.

Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

@tennlee , they way you have put it is not quite correct.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0.

What Nick says is correct. MSE can be interpreted as putting equal weight on all thresholds. The Ehm 2016 paper explains why. Happy to talk about this in another forum rather than a gitlab comment

@rob-taggart
Copy link
Collaborator

@nicholasloveday , one of @tennlee comments about weights makes me wonder whether:

  1. there is a weight argument in the threshold weighted scores, which applies specified weights to each of the individual score prior to calculating the mean across some dimensions; and
  2. how this kind of weight differs from threshold-weighting, and whether that is clear in the documentation.

I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

I am making comments about the tutorial here, since the file is too large for inline comments through github code review.
Regarding the tutorial, it notes that MSE can be interpreted as putting equal weight on all thresholds. Since it squares the error, it puts exponentially more weight on higher thresholds (since multiplicative biases at higher thresholds will consequentially have higher absolute errors which are then squared), so I think this isn't quite right to say. Also, the previous version noted the issue with MSE potentially obscuring certain things.

@tennlee , they way you have put it is not quite correct.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0.

What Nick says is correct. MSE can be interpreted as putting equal weight on all thresholds. The Ehm 2016 paper explains why. Happy to talk about this in another forum rather than a gitlab comment

Thanks, I've followed up offline.

My comments aren't about what's right and wrong, but about whether things are clear when introduced to a new reader with a general mathematical understanding of errors but without prior knowledge of decision thresholds. i.e. to help people avoid confusion about the exact disagreement being expressed in the comment.

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

@nicholasloveday , one of @tennlee comments about weights makes me wonder whether:

1. there is a weight argument in the threshold weighted scores, which applies specified weights to each of the individual score prior to calculating the mean across some dimensions; and

2. how this kind of weight differs from threshold-weighting, and whether that is clear in the documentation.

I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

Yes, spot on.

@tennlee
Copy link
Collaborator
tennlee commented Jul 31, 2024

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0

This would be a nice example to include.

@nicholasloveday
Copy link
Collaborator Author

The trapezoidal weighting example given in the paragraph "Trapezoidal Weighting" isn't the same as the one presented in code in the example. Consider adding a plot to the "Trapezoidal Weighting" section to show what is described in the text, or consider adjusting the text description to match the later code and say "see below for an example"

Added a text description in the trapezoidal weighting paragraph. I added rather than replaced as it is important to show how trapezoidal weights can be on either one side or both sides of the rectangular weights.

Take the following example. My decision threshold is 40C. If tomorrow's forecast exceeds 40C, I will tell the workers on my construction site not to come into work tomorrow. Now consider the case where the forecast for tomorrow was 20C and the observations was 10C. That is a large error, and the squared error is even larger (100C^2). However, I still made the correct decision on the basis of the forecast to not close the construction site. So the size of the penalty can be independent of whether the forecast was good enough for people whose decision thresholds are 40C or higher. In this case, threshold-weighted squared error actually give a penalty of 0
This would be a nice example to include.

I've added a modified example that lends itself more to a range of decision thresholds rather than a single decision threshold

I noticed in your docstrings that you talk about weights where one, etc, which is the second type of weight listed above, not the first.

To avoid confusion, you may need to talk consistently about threshold-weights (and explain that this is related to decision thresholds) to distinguish this from the other type of weighting. Both in the tutorial and in the docstrings for your functions

I have updated the tutorial and docstrings to clarify this.

With rectangular weighting, it's unclear if multiple segments/intervals could be supplied, e.g. the user might want to weight anything <= 3 and also anything >= 3 as "1", and "0" between -3 and positive 3. I'm assuming this is currently not possible, but the tutorial could do a better job of explaining what an "interval" is here.
It might be interesting to add an example for a variable which has both positive and negative values, either where the interval of interest is around the mean, or where it's on the extremes. It's not clear to me how this can be used on something like temperature, where both very hot and very cold extremes are both of similar (although potentially non-equal) interest. I really like the examples that are currently included, and I'm a bit worried about overcomplicating the tutorial, but I want to make sure this approach is suitable for that situation and an example would both put the implementation to the test and demonstrate appropriate practise.

Examples could include for example human comfort ranges for temperature or humidity, agriculture (where both heat and frost/ice could be problems), or operating equipment like trains where again there would be a temperature range.

Added an example in the tutorial. Also added an additional thing to try in the "things to try next" section.

@tennlee tennlee changed the title (ready for review) 608 add threshold weighted scores 608 add threshold weighted scores Aug 2, 2024
@tennlee tennlee merged commit 3f7f7a9 into nci:develop Aug 2, 2024
10 checks passed
@tennlee tennlee changed the title 608 add threshold weighted scores Add threshold weighted scores Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add threshold weighted scores
4 participants
0