-
Notifications
You must be signed in to change notification settings - Fork 216
Big improvement to robustness of CIME's XML handling #2932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change list: * Propagate Case read_only to its XML objects * Cache all files, but use file mod-time to detect changes. * Try to pass Case, not caseroot, to run_sub_or_cmd * Add a CIME performance test * Add tests for CIME XML handling
@@ -248,7 +248,7 @@ def _do_external(script_name, caseroot, rundir, lid, prefix): | |||
filename = "{}.external.log.{}".format(prefix, lid) | |||
outfile = os.path.join(rundir, filename) | |||
append_status("Starting script {}".format(script_name), "CaseStatus") | |||
run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile) | |||
run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile) # For sub, use case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to track down all the pre/post run scripts that people are using and change them to accept Case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it just call them as externals until the interface was updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean as shell commands? That would depend on if they are python programs or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually doesn't matter - if it's python but the interface is wrong it will run as a shell command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An argument having the wrong type would not trigger a SyntaxError or AttributeError unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure - thought that I had run this test before. But maybe that was a different number of arguments and not an argument with a different type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure. Different number of arguments would be an AttributeError. But, since python is dynamically typed, it will be happy to pass along an argument of a type that the receiving function is not expecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just looking at this now.
Can you explain the purpose of this comment? Is this a reminder of something that needs to be changed in the future? If so, adding a little more to this comment so it's understandable by others would help... also, consider adding a TODO in the comment to call it out as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to do it or not, just something to think about. It could be hard to track down all the scripts users are using for pre/post processing.
@@ -92,7 +92,7 @@ def check_lockedfile(self, filebase): | |||
"case.build --clean-all and rebuilding") | |||
else: | |||
self.set_value("BUILD_STATUS", 1) | |||
f2obj.set_value("BUILD_COMPLETE", False) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of an unflushed write error. We are still vulnerable to this kind of problem.
I see that you changed all of the cime buildlib files, removing 'with Case'. Will we also need to change active components similarly? In CESM, I see:
Should all of these be changed? If so, is that a requirement for compatibility with this PR, or just something that we could do whenever we get around to it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few inline comments.
I focused my review on generic_xml.py, and didn't think carefully about the changes sprinkled throughout the scripts, because I didn't really understand them. So I'll trust the two of you on the rest of those minor changes.
scripts/lib/CIME/XML/generic_xml.py
Outdated
@@ -91,7 +104,7 @@ def read(self, infile, schema=None): | |||
|
|||
logger.debug("File version is {}".format(str(self.get_version()))) | |||
|
|||
self._FILEMAP[infile] = (self.tree, self.root) | |||
self._FILEMAP[infile] = (self.tree, self.root, os.path.getmtime(infile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there's a possible race condition due to the timing of when you call getmtime to set the time of the cached data. Consider this scenario:
-
Starting point: the mtime of the given file is 100.
-
The current process needs to read from file. It opens the file and reads it.
-
Meanwhile, another process writes to this file, resulting in an updated mtime (say, 110).
-
Then the current process calls getmtime, getting the value 110. At this point, the current process would think it has data from mtime=110, whereas in fact it just has the data that were current as of mtime=100.
-
Later, another read is done (without any more writes happening). It would use the cached data (because it incorrectly thinks it has the data from mtime=110), whereas in fact it should reread the file.
I think that moving the call to getmtime to before the file_open would fix this problem. You could have the opposite problem – the current process thinks the data are out-of-date when in fact they're up-to-date – but that problem is less severe (just leading to a loss of efficiency, not incorrect data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@billsacks you are exactly right and this is the point I was trying to make in the discussion ticket that we will never achieve 100% robustness unless we access entry values through some kind of database system.
The reason my approach works (I think) is that I've made the simplifying assumption that CIME will never be used in the way you described. We've already discussed that the bulk of CIME is not thread safe, there's far too much shared state. Due to the situation with the XML files, it is also not safe to run multiple CIME processes in the same case concurrently. Note that all the CIME subprocess calls are synchronous calls (the parent spawns the child and then waits for it to finish before doing anything else), so no true concurrency is happening. The subprocesses can modify XML and, as long as the parent refreshes the cache when the subprocess is done, everything is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fair enough - thanks for the explanation.
timestamp_cache = self._FILEMAP[self.filename][2] | ||
if timestamp_cache != 0.0: | ||
timestamp_file = os.path.getmtime(self.filename) | ||
expect(timestamp_file == timestamp_cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought about how, if at all, this error-check interacts with my suggestions about changing where mtime is called.
scripts/lib/CIME/XML/generic_xml.py
Outdated
@@ -310,6 +329,9 @@ def write(self, outfile=None, force_write=False): | |||
else: | |||
with open(outfile,'w') as xmlout: | |||
xmlout.write(xmlstr) | |||
|
|||
self._FILEMAP[self.filename] = (self.tree, self.root, os.path.getmtime(self.filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, it seems like this could be vulnerable to the same race condition described above, but I don't know how to solve this one... and hopefully the system would prevent two processes from trying to write to the same file at the same time anyway, so the risk here seems small enough.
Yes, any buildlib that is implemented in python will need to change. |
This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface? |
So what will happen until that change is made? Will the build fail, for example? |
And: I forget what our latest thinking is on versioning (are we trying to use some flavor of semantic versioning?), but it seems like this change would then warrant some big bump in version number to indicate backwards-incompatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I approve now - thanks.
However, I'd still like to understand the backwards incompatibility issue a bit more, as raised in the comments I just made.
Thanks for your work on this!
Yes, it will crash with a python type error. |
That's a good question, the answer will depend on the direction that CIME goes in the future. I think the best possible outcome is that all the components' CIME-related scripts (buildnml, buildlib) are in Python and so we always stay within a single process. Most of CIME is inherently sequential, each phase cannot start until the previous phase is done, so thread concurrency is mostly not needed. The one exception is component builds, which happen in parallel. We will have to set some constraints on what a component buildlib script can do with its Case object; for example, by passing them a read-only Case object. If we can achieve all this, it should be fine to stick with the current system which is must simpler than using a "real" database like SQL. |
Thanks. Given this, I'd ask:
|
…pgrade * origin/master: (71 commits) remove debug print statement fix duplicated code remove duplicated code fix PFS generate issue more merge issues correct indentation make method static increase test time for CMOM fix pylint 3 issue fix merge issue fix merge issues fix pylint 2 error merge nuopc-cmeps to cime master reduce memleak reduce stdout another avgdt refinement update data model interface correct avgdt calculation more mpicom cleanup removing mpicom from nuopc code ...
The new performance shows that the gathering of file mod times is not impacting CIME performance. |
@jgfouca I'm still trying to figure out what, if anything, we need to change in the active components in CESM. Looking more closely at their buildlib scripts, I see that none of them actually define a buildlib function. Does that imply that we're fine for now? (I'm wondering if this means that all of the active components' buildlibs are being called as external programs rather than as python functions?) |
Yes - that is what it means. |
Change list:
Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes #2161
Fixes #2850
User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot.
Instructions for updating python buildlib scripts:
Update gh-pages html (Y/N)?:
Code review: @jedwards4b @billsacks