Big improvement to robustness of CIME's XML handling #2932

jgfouca · 2018-12-03T23:11:11Z

Change list:

Propagate Case read_only to its XML objects
Cache all files, but use file mod-time to detect changes.
Try to pass Case, not caseroot, to run_sub_or_cmd
Add a CIME performance test
Add tests for CIME XML handling

Test suite: scripts_regression_tests
Test baseline:
Test namelist changes:
Test status: bit for bit

User interface changes?: GenericXML will now throw an error if XML files are changed behind its back without a re-read/invalidate. It is now expected that all python buildlib calls take a Case object, not caseroot.

Instructions for updating python buildlib scripts:

The main function should open a Case object (preferably, read-only) and pass that object to the buildlib function, not the caseroot.
The buildlib function should take a case object, not a caseroot, and should therefore not have to open a Case object.

Update gh-pages html (Y/N)?:

Code review: @jedwards4b @billsacks

Change list: * Propagate Case read_only to its XML objects * Cache all files, but use file mod-time to detect changes. * Try to pass Case, not caseroot, to run_sub_or_cmd * Add a CIME performance test * Add tests for CIME XML handling

jedwards4b · 2018-12-03T23:24:42Z

scripts/lib/CIME/case/case_run.py

@@ -248,7 +248,7 @@ def _do_external(script_name, caseroot, rundir, lid, prefix):
    filename = "{}.external.log.{}".format(prefix, lid)
    outfile = os.path.join(rundir, filename)
    append_status("Starting script {}".format(script_name), "CaseStatus")
-    run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile)
+    run_sub_or_cmd(script_name, [caseroot], (os.path.basename(script_name).split('.',1))[0], [caseroot], logfile=outfile) # For sub, use case?


Yes, why not?

We'd have to track down all the pre/post run scripts that people are using and change them to accept Case.

wouldn't it just call them as externals until the interface was updated?

You mean as shell commands? That would depend on if they are python programs or not.

It actually doesn't matter - if it's python but the interface is wrong it will run as a shell command.

An argument having the wrong type would not trigger a SyntaxError or AttributeError unfortunately.

Are you sure - thought that I had run this test before. But maybe that was a different number of arguments and not an argument with a different type.

Pretty sure. Different number of arguments would be an AttributeError. But, since python is dynamically typed, it will be happy to pass along an argument of a type that the receiving function is not expecting.

Sorry, just looking at this now.

Can you explain the purpose of this comment? Is this a reminder of something that needs to be changed in the future? If so, adding a little more to this comment so it's understandable by others would help... also, consider adding a TODO in the comment to call it out as such.

I'm not sure we want to do it or not, just something to think about. It could be hard to track down all the scripts users are using for pre/post processing.

jgfouca · 2018-12-04T21:38:27Z

scripts/lib/CIME/case/check_lockedfiles.py

@@ -92,7 +92,7 @@ def check_lockedfile(self, filebase):
                                         "case.build --clean-all and rebuilding")
                    else:
                        self.set_value("BUILD_STATUS", 1)
-                    f2obj.set_value("BUILD_COMPLETE", False)
+


This is an example of an unflushed write error. We are still vulnerable to this kind of problem.

billsacks · 2018-12-06T21:51:08Z

I see that you changed all of the cime buildlib files, removing 'with Case'. Will we also need to change active components similarly? In CESM, I see:

[roo2:~/cesm_code/cesm/components]$ rgrep 'with Case' .
./pop/cime_config/buildnml:    with Case(caseroot) as case:
./pop/cime_config/phys_cycle_prerun:    with Case(caseroot) as case:
./pop/cime_config/phys_cycle_postrun:    with Case(caseroot) as case:
./pop/cime_config/buildcpp:    with Case(caseroot) as case:
./pop/cime_config/buildlib:    with Case(caseroot) as case:
./ww3/cime_config/buildnml:    with Case(caseroot) as case:
./cice/cime_config/buildnml:    with Case(caseroot) as case:
./cice/cime_config/buildcpp:    with Case(caseroot) as case:
./cice/cime_config/buildlib:    with Case(caseroot) as case:
./cam/cime_config/buildnml:    with Case(caseroot) as case:
./cam/cime_config/buildcpp:    with Case(caseroot) as case:
./cam/cime_config/buildlib:    with Case(caseroot) as case:
./mosart/cime_config/buildnml:    with Case(caseroot) as case:
./mosart/cime_config/buildlib:    with Case(caseroot) as case:
./rtm/cime_config/buildnml:    with Case(caseroot) as case:
./rtm/cime_config/buildlib:    with Case(caseroot) as case:
./clm/cime_config/buildnml:    with Case(caseroot) as case:
./clm/cime_config/buildlib:    with Case(caseroot) as case:

Should all of these be changed? If so, is that a requirement for compatibility with this PR, or just something that we could do whenever we get around to it?

billsacks

A few inline comments.

I focused my review on generic_xml.py, and didn't think carefully about the changes sprinkled throughout the scripts, because I didn't really understand them. So I'll trust the two of you on the rest of those minor changes.

scripts/lib/CIME/XML/generic_xml.py

billsacks · 2018-12-06T22:25:18Z

scripts/lib/CIME/XML/generic_xml.py

@@ -91,7 +104,7 @@ def read(self, infile, schema=None):

            logger.debug("File version is {}".format(str(self.get_version())))

-            self._FILEMAP[infile] = (self.tree, self.root)
+            self._FILEMAP[infile] = (self.tree, self.root, os.path.getmtime(infile))


It seems like there's a possible race condition due to the timing of when you call getmtime to set the time of the cached data. Consider this scenario:

Starting point: the mtime of the given file is 100.

The current process needs to read from file. It opens the file and reads it.

Meanwhile, another process writes to this file, resulting in an updated mtime (say, 110).

Then the current process calls getmtime, getting the value 110. At this point, the current process would think it has data from mtime=110, whereas in fact it just has the data that were current as of mtime=100.

Later, another read is done (without any more writes happening). It would use the cached data (because it incorrectly thinks it has the data from mtime=110), whereas in fact it should reread the file.

I think that moving the call to getmtime to before the file_open would fix this problem. You could have the opposite problem – the current process thinks the data are out-of-date when in fact they're up-to-date – but that problem is less severe (just leading to a loss of efficiency, not incorrect data).

@billsacks you are exactly right and this is the point I was trying to make in the discussion ticket that we will never achieve 100% robustness unless we access entry values through some kind of database system.

The reason my approach works (I think) is that I've made the simplifying assumption that CIME will never be used in the way you described. We've already discussed that the bulk of CIME is not thread safe, there's far too much shared state. Due to the situation with the XML files, it is also not safe to run multiple CIME processes in the same case concurrently. Note that all the CIME subprocess calls are synchronous calls (the parent spawns the child and then waits for it to finish before doing anything else), so no true concurrency is happening. The subprocesses can modify XML and, as long as the parent refreshes the cache when the subprocess is done, everything is OK.

Okay, fair enough - thanks for the explanation.

billsacks · 2018-12-06T22:25:53Z

scripts/lib/CIME/XML/generic_xml.py

+        timestamp_cache = self._FILEMAP[self.filename][2]
+        if timestamp_cache != 0.0:
+            timestamp_file  = os.path.getmtime(self.filename)
+            expect(timestamp_file == timestamp_cache,


I haven't thought about how, if at all, this error-check interacts with my suggestions about changing where mtime is called.

billsacks · 2018-12-06T22:28:39Z

scripts/lib/CIME/XML/generic_xml.py

@@ -310,6 +329,9 @@ def write(self, outfile=None, force_write=False):
        else:
            with open(outfile,'w') as xmlout:
                xmlout.write(xmlstr)
+
+        self._FILEMAP[self.filename] = (self.tree, self.root, os.path.getmtime(self.filename))


In theory, it seems like this could be vulnerable to the same race condition described above, but I don't know how to solve this one... and hopefully the system would prevent two processes from trying to write to the same file at the same time anyway, so the risk here seems small enough.

jgfouca · 2018-12-06T23:40:21Z

I see that you changed all of the cime buildlib files, removing 'with Case'. Will we also need to change active components similarly?

Yes, any buildlib that is implemented in python will need to change.

jedwards4b · 2018-12-06T23:59:52Z

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

billsacks · 2018-12-07T16:03:55Z

Yes, any buildlib that is implemented in python will need to change.

So what will happen until that change is made? Will the build fail, for example?

billsacks · 2018-12-07T16:05:39Z

And: I forget what our latest thinking is on versioning (are we trying to use some flavor of semantic versioning?), but it seems like this change would then warrant some big bump in version number to indicate backwards-incompatibility.

billsacks

Okay, I approve now - thanks.

However, I'd still like to understand the backwards incompatibility issue a bit more, as raised in the comments I just made.

Thanks for your work on this!

jgfouca · 2018-12-07T20:04:50Z

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

Yes, it will crash with a python type error.

jgfouca · 2018-12-07T20:14:49Z

This discussion should factor into the recurring discussion of changing data file formats - would it be better to implement something more like an sql database for our data file interface?

That's a good question, the answer will depend on the direction that CIME goes in the future. I think the best possible outcome is that all the components' CIME-related scripts (buildnml, buildlib) are in Python and so we always stay within a single process. Most of CIME is inherently sequential, each phase cannot start until the previous phase is done, so thread concurrency is mostly not needed. The one exception is component builds, which happen in parallel. We will have to set some constraints on what a component buildlib script can do with its Case object; for example, by passing them a read-only Case object.

If we can achieve all this, it should be fine to stick with the current system which is must simpler than using a "real" database like SQL.

billsacks · 2018-12-07T20:37:31Z

Yes, it will crash with a python type error.

Thanks. Given this, I'd ask:

I see that you updated the top-level comment to point this out a bit more, but can you please be even more explicit about this - i.e., basically giving a recipe for what components need to do in order to (a) find the places that need to be changed and (b) change them appropriately?
We should bump the version number when this is brought to master to give some indication of backwards incompatibility.

…pgrade * origin/master: (71 commits) remove debug print statement fix duplicated code remove duplicated code fix PFS generate issue more merge issues correct indentation make method static increase test time for CMOM fix pylint 3 issue fix merge issue fix merge issues fix pylint 2 error merge nuopc-cmeps to cime master reduce memleak reduce stdout another avgdt refinement update data model interface correct avgdt calculation more mpicom cleanup removing mpicom from nuopc code ...

jgfouca · 2018-12-07T23:28:29Z

The new performance shows that the gathering of file mod times is not impacting CIME performance.

billsacks · 2018-12-08T11:47:35Z

@jgfouca I'm still trying to figure out what, if anything, we need to change in the active components in CESM. Looking more closely at their buildlib scripts, I see that none of them actually define a buildlib function. Does that imply that we're fine for now? (I'm wondering if this means that all of the active components' buildlibs are being called as external programs rather than as python functions?)

jedwards4b · 2018-12-08T14:56:12Z

Yes - that is what it means.

jgfouca added ty: enhancement tp: user interface tp: CIMElib tp: script tools labels Dec 3, 2018

jgfouca self-assigned this Dec 3, 2018

jgfouca requested review from jedwards4b and billsacks December 3, 2018 23:11

jgfouca added the in progress label Dec 3, 2018

jedwards4b approved these changes Dec 3, 2018

View reviewed changes

Minor fixes

39a4bec

jgfouca commented Dec 4, 2018

View reviewed changes

We need to refresh case after applying user mods

f7ab0c3

billsacks requested changes Dec 6, 2018

View reviewed changes

Use named tuple for XML file cache entries

026b940

billsacks approved these changes Dec 7, 2018

View reviewed changes

jgfouca added 2 commits December 7, 2018 16:02

Use range instead of xrange

1f46546

jgfouca merged commit d712b83 into master Dec 7, 2018

jgfouca removed the in progress label Dec 7, 2018

jgfouca deleted the jgfouca/xml_cache_upgrade branch December 7, 2018 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Big improvement to robustness of CIME's XML handling #2932

Big improvement to robustness of CIME's XML handling #2932

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Big improvement to robustness of CIME's XML handling #2932

Big improvement to robustness of CIME's XML handling #2932

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!