Split workflows to facilitate CI restart in case of failing jobs #330

AntoineRondelet · 2020-12-16T17:16:43Z

dtebbs · 2020-12-17T09:49:01Z

I have no objection to splitting these up in principle. My concern about doing it purely by platform (macos vs ubuntu) is that it's not clear which file a specific job belongs to. For example, in order to find the client checks I have to know that we run those on ubuntu. As discussed in the past, it may happen that testing the full pipeline may happen to be easier to do on macos.

AntoineRondelet · 2020-12-17T17:19:29Z

I have no objection to splitting these up in principle. My concern about doing it purely by platform (macos vs ubuntu) is that it's not clear which file a specific job belongs to. For example, in order to find the client checks I have to know that we run those on ubuntu. As discussed in the past, it may happen that testing the full pipeline may happen to be easier to do on macos.

This asymmetry between checks ran on a given platform is only due to the poor state of our CI. Normally if we claim support for a platform (as we do in the README: https://github.com/clearmatics/zeth#building-and-running-the-project) this must imply that we run all checks on this platform. We don't do that for now, but this will need to be done in the future (and issues related to the solc compiler - on macOS - and grpc - on linux - will need to be solved). I don't think the approach where we run all checks only on "the platform on which it is the easiest to do so" is what should be done in the long run. That's obviously what we do for now (because we don't have the bandwidth to do a polished CI...). So in the end state, on-push-macos and on-push-ubuntu (and the on-PR files) should be running the exact same tests. While GA supports matrix configuration to do exactly that, the problem is that AFAIK for now GA does not support restart/stop operations on the jobs (it only does so on workflows, see discussion in #320), and this truly is annoying (to say the least). In the past weeks the macOS build has kept failing for weird reasons (python versions etc) which forces us to re-run the whole set of jobs of our "big" workflow. If there was one thing I loved better using Travis it was this exact possibility to re-start failing jobs...
So this approach of splitting files is a tradeoffs between: code duplication in the workflow config (we won't be using matrix config for the platform to run the same jobs and factorize config) and CI/jobs management. The current "asymmetric" situation between the sets of checks ran on both platform will hopefully be temporary once we have a way to solve for this solc and grpc thing... I don't know if there exists better way of achieving this, I am also happy to close this PR and associated ticket if we don't want to go down this path, but I need to confess that I find the regular issues with GA and the failing macOS to be a real pain :)

dtebbs · 2020-12-17T17:52:34Z

This asymmetry between checks ran on a given platform is only due to the poor state of our CI. Normally if we claim support for a platform (as we do in the README: https://github.com/clearmatics/zeth#building-and-running-the-project) this must imply that we run all checks on this platform. We don't do that for now, but this will need to be done in the future (and issues related to the solc compiler - on macOS - and grpc - on linux - will need to be solved). I don't think the approach where we run all checks only on "the platform on which it is the easiest to do so" is what should be done in the long run.

I'm not sure we need to run all checks on all platforms to claim support. For example, do we need to run a solidity or python linter on all platforms? Or even the clang-format check?

So this approach of splitting files is a tradeoffs between: code duplication in the workflow config (we won't be using matrix config for the platform to run the same jobs and factorize config) and CI/jobs management. The current "asymmetric" situation between the sets of checks ran on both platform will hopefully be temporary once we have a way to solve for this solc and grpc thing... I don't know if there exists better way of achieving this, I am also happy to close this PR and associated ticket if we don't want to go down this path, but I need to confess that I find the regular issues with GA and the failing macOS to be a real pain :)

OK. It would definitely be better if we could re-run individual jobs. TBH I hadn't really considered it enough of a problem to duplicate the configs (it's mainly a waste) but if it's super annoying then let's address it one way or another.

How about we split the checks out into their own onpush-checks workflow, then? And rename those platform specific ones onpush-build-linux and onpush-build-macos, or something similar? If we do want to run the checks on multipl platforms, it should be much easier to use the matrix in those cases. The point is that the top-level division of workflows would be related to the type, rather than the platform. Would that work?

(It seems we are not using the the matrix to specify platforms yet anyway, so less of an issue to split these build jobs by platform for now)

AntoineRondelet · 2020-12-18T12:24:42Z

I'm not sure we need to run all checks on all platforms to claim support. For example, do we need to run a solidity or python linter on all platforms? Or even the clang-format check?

Ah. No I don't think running the linters etc should necessarily be done on all platforms to claim support (although that would be nice to make sure that the tooling we use is supported on the platforms we target -- or at very least document well potential compatibility issues for devs to avoid troubles when contributing -> this is very annoying if the CI fails for instance for a linting issue if you can't, as a dev, reproduce locally and use the adequate tool to fix the issue). My point is mostly that all tests (python, solidity, c++) and builds (DEBUG, RELEASE) are successful for all supported platforms (some of these are currently spread across several jobs running on different architectures, so we don't tests everything on all platforms).

How about we split the checks out into their own onpush-checks workflow, then? And rename those platform specific ones onpush-build-linux and onpush-build-macos, or something similar? If we do want to run the checks on multipl platforms, it should be much easier to use the matrix in those cases. The point is that the top-level division of workflows would be related to the type, rather than the platform. Would that work?
(It seems we are not using the the matrix to specify platforms yet anyway, so less of an issue to split these build jobs by platform for now)

Yes using a matrix strategy would simplify the config. That being said I don't see how viable this is in the long run. Restarting a full set of jobs for multi-platform support whenever one of such job fails seems to be everything but scalable to me - but my knowledge of GA is pretty poor so I may be missing something, and would be very happy to know about ways around that :)

The point is that the top-level division of workflows would be related to the type, rather than the platform. Would that work?

So do I understand correctly that by "workflows would be related to the type" you suggest to move all checks like "linting checks" etc to the (generic/platform agnistic) on-push-checks.yml workflow since these are "platform agnostic" and keep the other "testing" jobs split per platform? If so, yes that seems good to me.
(like everything, we can try the "workflow splitting" to see how it goes. If this becomes a nightmare to maintain we can rollback!)

dtebbs · 2020-12-18T13:17:30Z

So do I understand correctly that by "workflows would be related to the type" you suggest to move all checks like "linting checks" etc to the (generic/platform agnistic) on-push-checks.yml workflow since these are "platform agnostic" and keep the other "testing" jobs split per platform? If so, yes that seems good to me.
(like everything, we can try the "workflow splitting" to see how it goes. If this becomes a nightmare to maintain we can rollback!)

So, what I'm suggesting is that we split into workflows of the form onpush-<type>-<further_divisions>.yml.
I.e. the top-level split is based on the category of job. Then, depending on the type we may want further splits.

For now, this would be:

onpush-checks.yml
onpush-build-ubuntu.yml
onpush-build-macos.yml

checks could be separated into onpush-linter.yml and onpush-cppcheck.yml, Or split up into language, or however you think makes sense. The point with this approach is that details such as: whether we run on one or all platforms, whether we use matrix or not to generate variations, whether we further split the jobs for some other reason (e.g. for convenience when rerunning), etc. these are all details that depends on the job type and "implementation".

If GA later implements rerunning individual jobs, and our build commands converge across the platforms, the build part could be implemented with a matrix in a single onpush-build.yml. We wouldn't need to move any of the other checks around in files.

Later, I guess we might be able to add onpush-integration-macos.yml etc, and it probably will make sense to split these by platform too, since the commands are more likely to vary and they'll be long-running (i.e. the re-running thing becomes important here too).

AntoineRondelet · 2020-12-18T15:48:51Z

Sounds good, let's do that!

AntoineRondelet requested a review from dtebbs as a code owner December 16, 2020 17:16

AntoineRondelet changed the title ~~Split workflows to facilitate CI restart~~ Split workflows to facilitate CI restart in case of failing jobs Dec 16, 2020

AntoineRondelet added 2 commits December 18, 2020 17:39

Split workflows to facilitate CI restart

fb296f7

Moved generic checks in a separate workflow

93803b0

AntoineRondelet force-pushed the split-ci-workflows branch from 4fe5271 to 3b05af4 Compare December 18, 2020 18:00

AntoineRondelet changed the base branch from cleanup-docker-ci to develop December 18, 2020 18:00

AntoineRondelet force-pushed the split-ci-workflows branch from 3b05af4 to 93803b0 Compare December 18, 2020 18:02

dtebbs approved these changes Dec 18, 2020

View reviewed changes

AntoineRondelet merged commit 5ff58bc into develop Dec 21, 2020

AntoineRondelet mentioned this pull request Dec 21, 2020

Split Github Action workflows #320

Closed

AntoineRondelet deleted the split-ci-workflows branch December 21, 2020 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split workflows to facilitate CI restart in case of failing jobs #330

Split workflows to facilitate CI restart in case of failing jobs #330

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Split workflows to facilitate CI restart in case of failing jobs #330

Split workflows to facilitate CI restart in case of failing jobs #330

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!