8000 vine: remove invalid files on worker by JinZhou5042 · Pull Request #4133 · cooperative-computing-lab/cctools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vine: remove invalid files on worker #4133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JinZhou5042
Copy link
Member
@JinZhou5042 JinZhou5042 commented Apr 22, 2025

Proposed Changes

For #4134

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 self-assigned this Apr 22, 2025
@JinZhou5042 JinZhou5042 marked this pull request as ready for review April 22, 2025 00:38
@JinZhou5042
Copy link
Member Author
JinZhou5042 commented Apr 22, 2025

Combining with this pr:

The long tails and execution sparsity seem gone:
image

All file transfers can complete:
image

Most files reach their desired replica count (10):

image

@dthain
Copy link
Member
dthain commented Apr 22, 2025

@JinZhou5042 I don't doubt that this fixes your immediate problem. But it's pretty obvious that this fix is not a good general solution. If you delete a replica at the source every time a transfer files, then problems at the receiving end are going to destroy good data aggressively.

We need to better understand the source of a problem in order to design a good solution.

Can you explain why these transfers are failing in the first place?

@JinZhou5042
Copy link
Member Author

@dthain It turned out that when we receive the cache-invalid message, the source is mostly NULL, and the code really played an effect is this:

if (to_worker) {
    delete_worker_file(q, to_worker, cachename, 0, 0);
} 

I believe the sequence is as follows:

  1. Worker A is fetching many files from worker B.
  2. Worker B suddenly crashes.
  3. All pending transfers on A fail with a series of cache-invalid messages.
  4. Since B has already crashed, its data structures have been released, so vine_current_transfers_get_source_worker(q, transfer_id) consistently returns NULL.
  5. So, we end up deleting files on the destination worker instead of the source.

Even if we don't want to agreesively delete files from the source worker, at least deleting them from the destination worker that returns the cache-invalid seems reasonable?

@JinZhou5042
Copy link
Member Author

And delete_worker_file does two things:

  • send a unlink message to the dest worker
  • remove the replica from the vine_file_replica_table

I believe that file has been correctly removed on the worker due to the transfer failure, so what really matters is that we should remove it from the replica table?

@JinZhou5042
Copy link
Member Author

Could it be that, the transfer has failed, but a zombie record with VINE_FILE_STATE_PENDING remains in the replica table? If a task later requests that file, the manager sees one of its input files as pending, and it simply delays scheduling that task. Because that file cannot come back again, and we lack a replica cleaning strategy, then the workflow is stuck forever.

@dthain
Copy link
Member
dthain commented Apr 22, 2025

@JinZhou5042 you are making a lot of hypotheses that could be answered by reading the code. :)

Look at handle_cache_invalid It removes the replica, and then removes the transfer record. An unlink is not necessary when the transfer fails, because the failed transfer is not committed to the cache on the worker side. Also note that it calls vine_current_transfers_set_failure to keep track of the number of related failures, and throttle future replications.

It's ok to hypothesize potential bugs, but then you need to show a path through the code that is clearly incorrect. Don't just guess!

@JinZhou5042
Copy link
Member Author

Ah yes, I missed a couple of lines.

@JinZhou5042
Copy link
Member Author

See #4152

@JinZhou5042 JinZhou5042 closed this May 1, 2025
@JinZhou5042 JinZhou5042 deleted the handle_cache_invalid branch May 1, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0