-
Notifications
You must be signed in to change notification settings - Fork 437
Windows: retry a file remove - a virus checker might lock files #11437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cb58a63
to
55c1a8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MSoegtropIMC! While we use Dune on Windows every day at LexiFi, I have not had reports of this particular issue as far as I remember. On the other hand, we explicitly make sure not to use any third-party virus scanners and instead use only the bulit-in Windows one, as we have reports of various types of unreliability related to third-party virus scanners.
That said, the fix seems mostly harmless and looks esentially good to me (modulo two questions below) if it fixes the issues you have observed.
Note that I regularly build about 180 opam packages on Windows (Coq Platform, I guess half of these use dune) and only 3 of these fail - but these fail fairly reliably (90%). I would expect it also fails if only the Windows virus scanner is there. As I explained in the issue this is a problem of how the Windows file system works, not how virus scanners work. Any virus scanner which does its job properly should produce an is 8000 sue if exe files are deleted very shortly after their creation. To fail the package needs to use certain form of PPX in specific ways to fail - what exactly it is I couldn't find out as yet. |
55c1a8b
to
242b027
Compare
@maiste : I saw you just rebased - I still want to add the logging, but I am super busy this week. In Coq Platform I used a patched dune, so it is not urgent for me. If someone just knows how to write the one liner for the logging, please go ahead. I would need 1/2 hour to look into the dune logging system. |
OK, I'll see if I've got time to take a look at it this week 👍 EDIT: I won't have time to work on it soon. |
For what it's worth, I tried to add logging to this PR. But it results in a dependency cycle because the logging module imports |
Yes, the |
Sorry for letting this run stale - I was busy. What are the next steps from my side? @Leonidas-from-XIV : regarding the blocking issue: that the virus scanner keeps a file locked when the build wants to delete it only happens under very high load scenarios and it takes longer than 1s only under extremely high load scenarios. I would say that in such scenarios it doesn't help anybody to exploit further parallelism. I would even say that under the line things will be faster if one dune build (out of the many triggered by opam) stalls a bit until the virus checker finished its job. If you run a single dune build it should hardly ever happen that the virus checker keeps the file locked when the build wants to delete it. Doing code optimisations for scenarios which are extremely unlikely doesn't make sense. |
One possibility is to merge the patch as-is and think about improvements (such as diagnostic logging) separately. Since this is needed to unblock some important builds (eg Coq) on Windows, I would not be opposed to doing that. |
@MSoegtropIMC can you:
Then we can merge the PR. |
@nojb: thanks for the review - I will follow up today as you suggested! |
I think it would make sense to also check the |
Indeed I can do this and see if it still works. The results will take a bit to be conclusive, but we can first merge the more restricted version and if I find that this does not fix it, fix the fix. |
We can find out which is the correct error code by:
|
Wait... based on #11425 (comment) it appears that the correct error code in question is |
I'd like to point out a different possible solution to this issue: simply make the "failed to delete sandbox"-error non-fatal. Ie. print a warning and let the build succeed instead of failing because some file(s) in the sandbox were in use and thus weren't deleted. This could optionally be guarded behind a config option, that Windows users with aggressive anti-virus software can enable. This solution would avoid people raising issues of the nature "dune build is extremely slow" caused by having many files in the sandbox open and thus spending 30 seconds deleting each file. |
No. That the directory is not empty is a follow up effect of the failure to delete the file. If one retries deleting the file until success, the directory would not be empty. |
Only few builds ever experience this and from these few only few need a retry once. The likelihood that you need 30 retires should be "few"^31. Even when multiplied with the number of total dune builds per year, this should be quite a bit smaller than the probability that earth suffers a "global killer" meteor hit in the next year (which is not that small, I would say around 10^-9). As @nojb pointed out in his initial post he does do regular largish builds on Windows and never run into this error, so he would not even had an occasional 1s delay from the fix I suggested. As I said before IMHO it doesn't make sense to optimise something which almost never happens and I would also say that your suggestion is more of a hack than mine because it ignores an error rathern than fixing it. |
I tend to agree. Let's go with the fix that already exist (this PR) and let us monitor what happens. We can always switch to a different approach (the patch is only a few lines long) if needed. |
P.S.: but I admit that your suggestion would fix my issues so if the consensus is to go this way, I don't object. |
OK - I will do the PR cleanup now (something git in between yesterday). |
One advantage of the alternative approach is that we could easily log what is happening to the console, since the sandboxing code has access to the |
628C
Sounds good, I think we can merge this PR as a first step. The alternative solution proposed by @rusv-simcorp could be a bit more general and slightly better from the user's point of view if it is able to emit a warning when the exception situation occurs. @rusv-simcorp: if you are motivated, feel free to open a PR with it, and we can review it and discuss whether we want to replace the fix here with it. |
2e3cdc4
to
4228193
Compare
I'll admit that I know little about how sandboxing works, but the worry I had -- and the issue I'd like to avoid -- is a user, say, opening the sandbox folder in some IDE, which then locks many files in the sandbox folder, thus slowing down the build process by 30 seconds for each such open file. Locking all files in a folder probably is the wrong this to do for any application, but with the patch in this PR it would create a situation (very slow build) that's nearly impossible to debug since the retry operation happens silently. |
Done. I added the restriction to |
Note that the virus checker issue is only there when dune creates executables in a sand box - which only very few projects do. And even if a project does this, issues are rare because they require a rare coincidence. This really happens only once in many million file operations and only for very specific projects (which create executables in sand boxes). The amortised cost is 0 for the vast majority of users (those not creating executables in sand boxes) and between 10ns and 100ns for the rest. I really don't understand what we are discussing here. As @nojb said: in the extremely unlikely case that there is a real issue, it is easy to fix. |
I agree that Here we can add some logic for retries (on windows). |
…eleted Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
Mmph, OK, fine: I pushed #11566 with an alternative approach. Can we try to quickly converge on a consensual solution? This PR has been opened for almost two months, it would be nice to merge a fix today or tomorrow. Thanks! |
Thinking about this some more, it doesn't sound right to me. If there's a failure to delete a file then this fatal error will be thrown, and execution will stop. I don't see a way for the code to ever get to the last step (of Could it be that the antivirus program is placing files inside the sandbox directory after the
|
No. I have never heard that virus checkers would do this. Besides I did look into the sandbox folders and what was in there was the executable - nothing else. |
Anyway - I posted a solution which is well tested in CI and did some minor adjustments as discussed. I am sure there are 100 other ways of fixing this and that one can discuss this forever. If you see the logging as issue, as I said I would suggest to pass a logging function to the unlink function - it is called only in a few places and in a functional language we have closures which make it easy enough to pass a logging function. Also as I said I would prefer to fix the issue at the root - and I explained that due to the nature of the Windows file system there is no other way than retrying - rather than cleaning up some mess later. |
Looks like the solution you posted initially (which is well-tested) retries regardless of the exception. After which you pushed this change https://github.com/ocaml/dune/compare/2e3cdc4561a7b460396ba3dcc0ce262414e3a754..4228193412b3aa4c8e06e8a5d311fb9dff620838 which retries only in case of the Is the latter (current) solution not well-tested? I'm asking because I don't see how retrying in |
The version with the more strict error check is not tested in CI, but I don't see how this should make a difference. One should get the same error in retrying and anyway one usually does not get an error after a retry, so this will be really hard to test. Of course the stricter check after changing permissions is easy to test but there i also don't see why the error code should change during a retry, so I don't see a difference. |
P.S.: note that clear_dir uses a variant of unlink which throws away all errors, so your argument that dune should fail already when unlinking the files is not correct. |
Thank you for pointing this out. I hadn't noticed this. That explains it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I move to merge the PR in its current state. We can keep the alternatives that have been proposed in mind for the future in case improvements are needed.
@MSoegtropIMC: I applied formatting to your patch and the formatting of your Changes entry. Thank you for your patience!
And thanks everyone for your participation in the discussion!
PR is good to merge once the CI passes.
Fixes ocaml#11425 Signed-off-by: Michael Soegtrop <7895506+MSoegtropIMC@users.noreply.github.com>
Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
f562827
to
9b1823c
Compare
I should note that we're also experiencing the symptom of #11425 ( |
OK, thanks for the info! Please report back once you have the results. I still intend to merge this PR as-is as we already know that it fixes at least one problem. There's nothing keeping us from adding a second fix along the lines of #11566 later or even replacing this one by a more general fix, but there's no reason to block what has been done here further either. |
@rusv-simcorp : can you create a directory dump of the sand box folder in your CI? |
CHANGES: ### Fixed - Support HaikuOS: don't call `execve` since it's not allowed if other pthreads have been created. The fact that Haiku can't call `execve` from other threads than the principal thread of a process (a team in haiku jargon), is a discrepancy to POSIX and hence there is a [bug about it](https://dev.haiku-os.org/ticket/18665). (@Sylvain78, ocaml/dune#10953) - Fix flag ordering in generated Merlin configurations (ocaml/dune#11503, @voodoos, fixes ocaml/merlin#1900, reported by @vouillon) ### Added - Add `(format-dune-file <src> <dst>)` action. It provides a replacement to `dune format-dune-file` command. (ocaml/dune#11166, @nojb) - Allow the `--prefix` flag when configuring dune with `ocaml configure.ml`. This allows to set the prefix just like `$ dune install --prefix`. (ocaml/dune#11172, @rgrinberg) - Allow arguments starting with `+` in preprocessing definitions (starting with `(lang dune 3.18)`). (@amonteiro, ocaml/dune#11234) - Support for opam `(maintenance_intent ...)` in dune-project (ocaml/dune#11274, @art-w) - Validate opam `maintenance_intent` (ocaml/dune#11308, @art-w) - Support `not` in package dependencies constraints (ocaml/dune#11404, @art-w, reported by @hannesm) ### Changed - Warn when failing to discover root due to reads failing. The previous behavior was to abort. (@KoviRobi, ocaml/dune#11173) - Use shorter path for inline-tests artifacts. (@hhugo, ocaml/dune#11307) - Allow dash in `dune init` project name (ocaml/dune#11402, @art-w, reported by @saroupille) - On Windows, under heavy load, file delete operations can sometimes fail due to AV programs, etc. Guard against it by retrying the operation up to 30x with a 1s waiting gap (ocaml/dune#11437, fixes ocaml/dune#11425, @MSoegtropIMC) - Cache: we now only store the executable permission bit for files (ocaml/dune#11541, fixes ocaml/dune#11533, @ElectreAAS) - Display negative error codes on Windows in hex which is the more customary way to display `NTSTATUS` codes (ocaml/dune#11504, @MisterDA)
Incidentally, the issue reported here looks very similar to 1. in ocaml/ocaml#13921. |
…l#11437) * Windows: retry a file remove - a virus checker might lock files Fixes ocaml#11425 Signed-off-by: Michael Soegtrop <7895506+MSoegtropIMC@users.noreply.github.com> * make fmt Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> * Tweak changes Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> --------- Signed-off-by: Michael Soegtrop <7895506+MSoegtropIMC@users.noreply.github.com> Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> Co-authored-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
CHANGES: ### Fixed - Support HaikuOS: don't call `execve` since it's not allowed if other pthreads have been created. The fact that Haiku can't call `execve` from other threads than the principal thread of a process (a team in haiku jargon), is a discrepancy to POSIX and hence there is a [bug about it](https://dev.haiku-os.org/ticket/18665). (@Sylvain78, ocaml/dune#10953) - Fix flag ordering in generated Merlin configurations (ocaml/dune#11503, @voodoos, fixes ocaml/merlin#1900, reported by @vouillon) ### Added - Add `(format-dune-file <src> <dst>)` action. It provides a replacement to `dune format-dune-file` command. (ocaml/dune#11166, @nojb) - Allow the `--prefix` flag when configuring dune with `ocaml configure.ml`. This allows to set the prefix just like `$ dune install --prefix`. (ocaml/dune#11172, @rgrinberg) - Allow arguments starting with `+` in preprocessing definitions (starting with `(lang dune 3.18)`). (@amonteiro, ocaml/dune#11234) - Support for opam `(maintenance_intent ...)` in dune-project (ocaml/dune#11274, @art-w) - Validate opam `maintenance_intent` (ocaml/dune#11308, @art-w) - Support `not` in package dependencies constraints (ocaml/dune#11404, @art-w, reported by @hannesm) ### Changed - Warn when failing to discover root due to reads failing. The previous behavior was to abort. (@KoviRobi, ocaml/dune#11173) - Use shorter path for inline-tests artifacts. (@hhugo, ocaml/dune#11307) - Allow dash in `dune init` project name (ocaml/dune#11402, @art-w, reported by @saroupille) - On Windows, under heavy load, file delete operations can sometimes fail due to AV programs, etc. Guard against it by retrying the operation up to 30x with a 1s waiting gap (ocaml/dune#11437, fixes ocaml/dune#11425, @MSoegtropIMC) - Cache: we now only store the executable permission bit for files (ocaml/dune#11541, fixes ocaml/dune#11533, @ElectreAAS) - Display negative error codes on Windows in hex which is the more customary way to display `NTSTATUS` codes (ocaml/dune#11504, @MisterDA)
…l#11437) * Windows: retry a file remove - a virus checker might lock files Fixes ocaml#11425 Signed-off-by: Michael Soegtrop <7895506+MSoegtropIMC@users.noreply.github.com> * make fmt Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> * Tweak changes Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> --------- Signed-off-by: Michael Soegtrop <7895506+MSoegtropIMC@users.noreply.github.com> Signed-off-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com> Co-authored-by: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
Fixes #11425
A few notes: the Windows error in this situation is "file lockaed" and not "permission denied", but afaik the Unix error codes do not distinguish these cases.
One could be more restrictive in further with blocks also handling only `EACCESS´ - I am not sure what is best.
Regarding the retry time of 30x1s: this might be excessive, but 10s may happen on a heavily loaded machine. Usually a virus scanner should require less than 1s.
I tested this in a large opam meta project (Coq Platform) where without this patch 3 builds fail. With this patch, this worked reliably (tested with dune 3.16.1 because dune 3.17.X has other issues on Windows I didn't research in detail as yet, but e.g. cariro doesn't find cairo.h with 3.17.2 but it does with 3.16.1.