-
Notifications
You must be signed in to change notification settings - Fork 53
change the subsampling method #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mpled indexes, the original method use the slicing function in pandas which requires the slicing step to be integer
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
- Coverage 97.59% 97.58% -0.02%
==========================================
Files 9 9
Lines 374 372 -2
Branches 59 59
==========================================
- Hits 365 363 -2
Misses 4 4
Partials 5 5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a quick look, I don't really understand the changes well enough yet.
@@ -4,6 +4,7 @@ | |||
import numpy as np | |||
from pymbar.timeseries import statisticalInefficiency | |||
from pymbar.timeseries import detectEquilibration | |||
from pymbar.timeseries import subsampleCorrelatedData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import together (combine all three lines, no idea why they were separate)
from pymbar.timeseries import statisticalInefficiency, detectEquilibration, subsampleCorrelatedData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst yes, changed to one line
df = df.loc[series.index] | ||
|
||
#use the subsampleCorrelatedData function to get the subsample index | ||
indices = subsampleCorrelatedData(series, g=statinef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you briefly explain what subsampleCorrelatedData
does (and maybe link to the code in their repo)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst yes, the link to the code is https://github.com/choderalab/pymbar/blob/ac648e63787592a251992a764876713a64f17455/pymbar/timeseries.py#L632 it basically subsample the time series (series
) based on the given g
(statistical inefficiency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData
function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotsdl yes, the code you refer to is an optional option conservative
, and the default for that option is False
. I didn't notice that option before while looking at the details, I think the implementation is also different from what alchemlyb currently had.
The topic here is basically all related to the precision and distinguishability of the subsampling. For a given g
like 1.5, current alchemlyb code will round up using int(np.rint(1.5))
= 1.0 and, the conservative
option will round to 2 using math.ceil(1.5)
and the non conservative
option will use 1.5. For a given time series with 1,000 data point, current alchemlyb; conservative
; non conservative
will get 1,000; 500; 666 data point respectively. Since the energy will depend on the subsampling, I think we should be careful about how we round the g
and I suggested to use the non conservative
option here to make all dataset as much distinguishable as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False)
(especially conservative=False
to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with #38 (comment) – until we come up with anything better inside alchemlyb I would also go with the pymbar version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst yes, sounds good to me. I will add the default options explicitly.
indices = subsampleCorrelatedData(series, g=statinef) | ||
picked_time_index = [] | ||
#pick the time index for the pandas dataframe based on the python index from subsample | ||
for s_index, s_index_pair in enumerate(series.index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks clunky and potentially slow (although that might not matter much). Isn't there a direct way to do this without the for
loop?
(I don't really know what the data look like to say more here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, let me check if there is a better way to rewrite this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame
or Series
. You can directly extract all values for an integer-based index with series.iloc[indices]
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotsdl yes, the iloc[indices]
works. Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to address.
df = df.loc[series.index] | ||
|
||
#use the subsampleCorrelatedData function to get the subsample index | ||
indices = subsampleCorrelatedData(series, g=statinef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see the point of this change. From what I can tell the subsampleCorrelatedData
function deals in integer-spaced samples, too. @shuail can you explain how this approach really differs from the one we already have here?
indices = subsampleCorrelatedData(series, g=statinef) | ||
picked_time_index = [] | ||
#pick the time index for the pandas dataframe based on the python index from subsample | ||
for s_index, s_index_pair in enumerate(series.index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will be very slow. You generally want to avoid looping through the rows in a DataFrame
or Series
. You can directly extract all values for an integer-based index with series.iloc[indices]
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See minor comments, otherwise I agree with your rationale for changing.
Please also update the docs and explicitly explain that we are using pymbar.timeseries.subsampleCorrelatedData(..., fast=False, conservative=False)
and say that this can lead to non-uniform sampling of the time series. Thus it is not safe to assume anymore that frames are equally spaced in time.
Do we have tests that distinguish old from new behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs and tests.
Otherwise I agree with the change on scientific grounds.
|
||
df = df.loc[series.index] | ||
|
||
#use the subsampleCorrelatedData function to get the subsample index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the docs: make clear that intervals may be not uniform any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will add docs
df = df.loc[series.index] | ||
|
||
#use the subsampleCorrelatedData function to get the subsample index | ||
indices = subsampleCorrelatedData(series, g=statinef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add explicit keywords for subsampleCorrelatedData(..., fast=False, conservative=False, verbose=False)
(especially conservative=False
to make sure that the behavior in alchemlyb won't change, even if pymbar changes its defaults.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we got tests that test fractional g
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test_preprocessing.py code, and inside there is a test_subsampling function
def test_subsampling(self, data, size): |
g
value or the actual size of the subsample data. Also, should we define a slicer function inside this class, maybe a bug here?
Btw, @shuail this PR appears to come from your master branch. I recommend that PRs get their own branches. |
@orbeckst thanks for the comments, yes, will clean up the code (add more docs/explicit the options) and create another PR from the not master branch. |
So just to chime in here -- we're talking about a fairly significant change on the basis of how to round
Let's think for a minute about the practical difference between using different values of In that case -- as we approach the limit of problematic data -- what's our biggest goal? Is it being able to squeeze every last bit of POSSIBLE value out of the data we have, even if that results in a false appearance of precision when in fact our results are garbage? Or is it being able to correctly recognize that in fact we don't have enough data? I'd argue that the latter goal is more important: It's worse to get an "apparently reasonable" answer that is wrong than to get a "clearly noisy/useless" answer. Thus, in my view, we should err in the direction of conservatism (at least by default) and round In my previous experience though, statistical uncertainties tend to always be underestimated anyway, so rounding up seems like a relatively safe thing to do if we need to round. So in my mind the decision is between:
One of our overall goals is to allow easy coupling with things like trajectory analysis, which may be an argument in favor of the simpler approach (round g upwards to the nearest integer) since having non-uniform frame spacing is going to complicate the analysis a lot. |
@davidlmobley In our practice, supposing For coupling the energy vs trajectory analysis, I think the pacing of writing out energy vs trajectory is arbitrary and will depend on the users input, even the uniform frame spacing of subsampling would not benefit much in this case. So I would like to use the unsubsampled data for the energy vs trajectory analysis. |
Good points. The current code rounds to nearest integer so that can create the appearance of more independent samples than there really are. Perhaps use the conservative approach as default and add an option for users to choose the non-uniform sampling (and making sure that the frame indices for slicing the original trajectory remain available downstream). |
To summarize the discussion so far:
I suggest that we
The main question that would remain is what should the default be? I am tending towards |
I think I'm on board with all of this and also lean towards having the default be |
I think I'm fine with @orbeckst suggested procedure. For the default option, I am also fine with the rounding approach since that's close to the original code and philosophy. I could change the option for my own use anyway. |
Good, so we have a consensus.
Can you please modify your PR accordingly?
We will also need at least two tests that check that the two approaches work.
…--
Oliver Beckstein
email: orbeckst@gmail.com
Am Nov 16, 2017 um 07:37 schrieb shuail ***@***.***>:
I think I'm fine with @orbeckst suggested procedure. For the default option, I am also fine with the rounding approach since that's close to the original code and philosophy. I could change the option for my own use anyway.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yes, I will work on the code and the tests. Once I finished that, will create another PR from a separate branch other than |
change the subsampling code to use the pymbar module to get the subsampled indexes, the original code use the slicing function in pandas which requires the slicing step to be integer. While in theory I think we don't want to force the subsampling steps to be integer, will open an issue to discuss this.