8000 Big improvement to robustness of CIME's XML handling by jgfouca · Pull Request #2932 · ESMCI/cime · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Big improvement to robustness of CIME's XML handling #2932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 7, 2018

Conversation

jgfouca
Copy link
Contributor
@jgfouca jgfouca commented Dec 3, 2018

Change list:

  • Propagate Case read_only to its XML objects
  • Cache all files, but use file mod-time to detect changes.
  • Try to pass Case, not caseroot, to run_sub_or_cmd
  • Add a CIME performance test
  • Add tests for CIME XML handling

Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #2161
Fixes #2850

User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot.

Instructions for updating python buildlib scripts:

  • The main function should open a Case object (preferably, read-only) and pass that object to the buildlib function, not the caseroot.
  • The buildlib function should take a case object, not a caseroot, and should therefore not have to open a Case object.

Update gh-pages html (Y/N)?:

Code review: @jedwards4b @billsacks

Change list:
* Propagate Case read_only to its XML objects
* Cache all files, but use file mod-time to detect changes.
* Try to pass Case, not caseroot, to run_sub_or_cmd
* Add a CIME performance test
* Add tests for CIME XML handling
@@ -248,7 +248,7 @@ def _do_external(script_name, caseroot, rundir, lid, prefix):
filename = "{}.external.log.{}".format(prefix, lid)
outfile = os.path.join(rundir, filename)
append_status("Starting script {}".format(script_name), "CaseStatus")
run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile)
run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile) # For sub, use case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to track down all the pre/post run scripts that people are using and change them to accept Case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it just call them as externals until the interface was updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean as shell commands? That would depend on if they are python programs or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't matter - if it's python but the interface is wrong it will run as a shell command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An argument having the wrong type would not trigger a SyntaxError or AttributeError unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure - thought that I had run this test before. But maybe that was a different number of arguments and not an argument with a different type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure. Different number of arguments would be an AttributeError. But, since python is dynamically typed, it will be happy to pass along an argument of a type that the receiving function is not expecting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just looking at this now.

Can you explain the purpose of this comment? Is this a reminder of something that needs to be changed in the future? If so, adding a little more to this comment so it's understandable by others would help... also, consider adding a TODO in the comment to call it out as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do it or not, just something to think about. It could be hard to track down all the scripts users are using for pre/post processing.

@@ -92,7 +92,7 @@ def check_lockedfile(self, filebase):
"case.build --clean-all and rebuilding")
else:
self.set_value("BUILD_STATUS", 1)
f2obj.set_value("BUILD_COMPLETE", False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of an unflushed write error. We are still vulnerable to this kind of problem.

@billsacks
Copy link
Member

I see that you changed all of the cime buildlib files, removing 'with Case'. Will we also need to change active components similarly? In CESM, I see:

[roo2:~/cesm_code/cesm/components]$ rgrep 'with Case' .
./pop/cime_config/buildnml:    with Case(caseroot) as case:
./pop/cime_config/phys_cycle_prerun:    with Case(caseroot) as case:
./pop/cime_config/phys_cycle_postrun:    with Case(caseroot) as case:
./pop/cime_config/buildcpp:    with Case(caseroot) as case:
./pop/cime_config/buildlib:    with Case(caseroot) as case:
./ww3/cime_config/buildnml:    with Case(caseroot) as case:
./cice/cime_config/buildnml:    with Case(caseroot) as case:
./cice/cime_config/buildcpp:    with Case(caseroot) as case:
./cice/cime_config/buildlib:    with Case(caseroot) as case:
./cam/cime_config/buildnml:    with Case(caseroot) as case:
./cam/cime_config/buildcpp:    with Case(caseroot) as case:
./cam/cime_config/buildlib:    with Case(caseroot) as case:
./mosart/cime_config/buildnml:    with Case(caseroot) as case:
./mosart/cime_config/buildlib:    with Case(caseroot) as case:
./rtm/cime_config/buildnml:    with Case(caseroot) as case:
./rtm/cime_config/buildlib:    with Case(caseroot) as case:
./clm/cime_config/buildnml:    with Case(caseroot) as case:
./clm/cime_config/buildlib:    with Case(caseroot) as case:

Should all of these be changed? If so, is that a requirement for compatibility with this PR, or just something that we could do whenever we get around to it?

Copy link
Member
@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few inline comments.

I focused my review on generic_xml.py, and didn't think carefully about the changes sprinkled throughout the scripts, because I didn't really understand them. So I'll trust the two of you on the rest of those minor changes.

@@ -91,7 +104,7 @@ def read(self, infile, schema=None):

logger.debug("File version is {}".format(str(self.get_version())))

self._FILEMAP[infile] = (self.tree, self.root)
self._FILEMAP[infile] = (self.tree, self.root, os.path.getmtime(infile))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a possible race condition due to the timing of when you call getmtime to set the time of the cached data. Consider this scenario:

  • Starting point: the mtime of the given file is 100.

  • The current process needs to read from file. It opens the file and reads it.

  • Meanwhile, another process writes to this file, resulting in an updated mtime (say, 110).

  • Then the current process calls getmtime, getting the value 110. At this point, the current process would think it has data from mtime=110, whereas in fact it just has the data that were current as of mtime=100.

  • Later, another read is done (without any more writes happening). It would use the cached data (because it incorrectly thinks it has the data from mtime=110), whereas in fact it should reread the file.

I think that moving the call to getmtime to before the file_open would fix this problem. You could have the opposite problem – the current process thinks the data are out-of-date when in fact they're up-to-date – but that problem is less severe (just leading to a loss of efficiency, not incorrect data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billsacks you are exactly right and this is the point I was trying to make in the discussion ticket that we will never achieve 100% robustness unless we access entry values through some kind of database system.

The reason my approach works (I think) is that I've made the simplifying assumption that CIME will never be used in the way you described. We've already discussed that the bulk of CIME is not thread safe, there's far too much shared state. Due to the situation with the XML files, it is also not safe to run multiple CIME processes in the same case concurrently. Note that all the CIME subprocess calls are synchronous calls (the parent spawns the child and then waits for it to finish before doing anything else), so no true concurrency is happening. The subprocesses can modify XML and, as long as the parent refreshes the cache when the subprocess is done, everything is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair enough - thanks for the explanation.

timestamp_cache = self._FILEMAP[self.filename][2]
if timestamp_cache != 0.0:
timestamp_file = os.path.getmtime(self.filename)
expect(timestamp_file == timestamp_cache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about how, if at all, this error-check interacts with my suggestions about changing where mtime is called.

@@ -310,6 +329,9 @@ def write(self, outfile=None, force_write=False):
else:
with open(outfile,'w') as xmlout:
xmlout.write(xmlstr)

self._FILEMAP[self.filename] = (self.tree, self.root, os.path.getmtime(self.filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it seems like this could be vulnerable to the same race condition described above, but I don't know how to solve this one... and hopefully the system would prevent two processes from trying to write to the same file at the same time anyway, so the risk here seems small enough.

@jgfouca
Copy link
Contributor Author
jgfouca commented Dec 6, 2018

I see that you changed all of the cime buildlib files, removing 'with Case'. Will we also need to change active components similarly?

Yes, any buildlib that is implemented in python will need to change.

@jedwards4b
Copy link
Contributor

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

@billsacks
Copy link
Member

Yes, any buildlib that is implemented in python will need to change.

So what will happen until that change is made? Will the build fail, for example?

@billsacks
Copy link
Member

And: I forget what our latest thinking is on versioning (are we trying to use some flavor of semantic versioning?), but it seems like this change would then warrant some big bump in version number to indicate backwards-incompatibility.

Copy link
Member
@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I approve now - thanks.

However, I'd still like to understand the backwards incompatibility issue a bit more, as raised in the comments I just made.

Thanks for your work on this!

@jgfouca
Copy link
Contributor Author
jgfouca commented Dec 7, 2018

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

Yes, it will crash with a python type error.

@jgfouca
Copy link
Contributor Author
jgfouca commented Dec 7, 2018

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

That's a good question, the answer will depend on the direction that CIME goes in the future. I think the best possible outcome is that all the components' CIME-related scripts (buildnml, buildlib) are in Python and so we always stay within a single process. Most of CIME is inherently sequential, each phase cannot start until the previous phase is done, so thread concurrency is mostly not needed. The one exception is component builds, which happen in parallel. We will have to set some constraints on what a component buildlib script can do with its Case object; for example, by passing them a read-only Case object.

If we can achieve all this, it should be fine to stick with the current system which is must simpler than using a "real" database like SQL.

@billsacks
Copy link
Member

Yes, it will crash with a python type error.

Thanks. Given this, I'd ask:

  1. I see that you updated the top-level comment to point this out a bit more, but can you please be even more explicit about this - i.e., basically giving a recipe for what components need to do in order to (a) find the places that need to be changed and (b) change them appropriately?

  2. We should bump the version number when this is brought to master to give some indication of backwards incompatibility.

…pgrade

* origin/master: (71 commits)
  remove debug print statement
  fix duplicated code
  remove duplicated code
  fix PFS generate issue
  more merge issues
  correct indentation
  make method static
  increase test time for CMOM
  fix pylint 3 issue
  fix merge issue
  fix merge issues
  fix pylint 2 error
  merge nuopc-cmeps to cime master
  reduce memleak
  reduce stdout
  another avgdt refinement
  update data model interface
  correct avgdt calculation
  more mpicom cleanup
  removing mpicom from nuopc code
  ...
@jgfouca
Copy link
Contributor Author
jgfouca commented Dec 7, 2018

The new performance shows that the gathering of file mod times is not impacting CIME performance.

@jgfouca jgfouca merged commit d712b83 into master Dec 7, 2018
@jgfouca jgfouca deleted the jgfouca/xml_cache_upgrade branch December 7, 2018 23:28
@billsacks
Copy link
Member

@jgfouca I'm still trying to figure out what, if anything, we need to change in the active components in CESM. Looking more closely at their buildlib scripts, I see that none of them actually define a buildlib function. Does that imply that we're fine for now? (I'm wondering if this means that all of the active components' buildlibs are being called as external programs rather than as python functions?)

@jedwards4b
Copy link
Contributor

Yes - that is what it means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0