8000 vine: starvation in worker transfers · Issue #4116 · cooperative-computing-lab/cctools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vine: starvation in worker transfers #4116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JinZhou5042 opened this issue Apr 15, 2025 · 6 comments
Closed

vine: starvation in worker transfers #4116

JinZhou5042 opened this issue Apr 15, 2025 · 6 comments
Assignees
Labels
bug For modifications that fix a flaw in the code. critical TaskVine

Comments

@JinZhou5042
Copy link
Member
JinZhou5042 commented Apr 15, 2025

I'm fairly certain this is the primary issue causing the significant slowdown at the end of large workflows.

Previously, I tried plotting the number of incoming and outgoing transfers on each worker to check whether file transfers were fast enough and whether they might be the bottleneck. I consistently observed that the manager scheduled transfers early on, but never received responses indicating whether the transfers succeeded or failed. I initially assumed this was a bug in the plotting logic.

For example, the following figure shows the number of incoming transfers on each worker (each line represents a worker). It seems that some transfers got stuck on certain workers without making any progress:

Image

It turned out that there was a bug in the worker's file fetching logic. In vine_cache_ensure, the worker forks a process to fetch files from other workers (same thing as vine_transfer_server.c does), and the number of concurrent fetch processes is limited by c->max_transfer_procs. However, once this threshold is reached, additional transfers are silently delayed forever.

This creates a serious issue: the manager enforces a throttle (q->worker_src_max_transfers, default 10) per worker, assuming that each scheduled transfer is actively being handled. If a transfer never completes and no response is returned, the manager incorrectly assumes the worker is still busy and avoids scheduling further transfers to it.

My tentative solution is pretty straightforward and gives me consistently great performance: the concurrency throttle is still 10, but we periodically check the pending transfers and try to shcedule them in the worker's main loop. The funciton looks like:

int vine_cache_process_pending(struct vine_cache *c)
{
	int processed = 0;

	int num_processing = list_size(c->processing_transfers);
	int num_pending = list_size(c->pending_transfers);
	printf("num_processing: %d, num_pending: %d\n", num_processing, num_pending);

	while (list_size(c->processing_transfers) < c->max_transfer_procs && list_size(c->pending_transfers) > 0) {
		char *queue_cachename = list_pop_head(c->pending_transfers);
		if (!queue_cachename) {
			break;
		}
		vine_cache_status_t status = vine_cache_ensure(c, queue_cachename);
		if (status == VINE_CACHE_STATUS_PROCESSING) {
			char *new_queue_name = strdup(queue_cachename);
			list_push_tail(c->processing_transfers, new_queue_name);
			processed++;
		} else if (status == VINE_CACHE_STATUS_PENDING) {
			char *new_queue_name = strdup(queue_cachename);
			list_push_tail(c->pending_transfers, new_queue_name);
		}
		free(queue_cachename);
	}

	return processed;
}

It turned out that the manager’s throttle can be set to a very large value (e.g., 10000), while each worker handles the incoming and outgoing concurrency. All transfers (regardless of success or failure) complete very fast without causing delays.

@dthain
Copy link
Member
dthain commented Apr 15, 2025

Glad you found it, I'm not quite sure that I understand the solution. Make a PR?

@JinZhou5042
Copy link
Member Author

Will do that.

@JinZhou5042 JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Apr 16, 2025
@dthain
Copy link
Member
dthain commented Apr 16, 2025

Per our discussion today, the problem:

  • Normal file transfer jobs will eventually get run because they are required by tasks that will call vine_cache_ensure() on that file.
  • However, file transfer jobs requested simply for replication are not associated with any task, and so we may just forget to call vine_cache_ensure(). And so they never get run.
  • Eventually, the manager will stop sending file transfer requests to the worker, and the workflow gets stuck.

Proposed solution:

  • Simplest: Periodically attempt to start all transfers (within concurrency limits). That removes the "priority" in favor of task-related file transfers.
  • A little more complicated: Periodically attempt to start all transfers with the NOW flag, because those are associated with replication and not with tasks.

Question: Why exactly do we distinguish between put url and put url_now. What problem does that solve, and is it acceptable to unify the cases?

@colinthomas-z80 what do you think?

@colinthomas-z80
Copy link
Contributor

My understanding was that replication transfer jobs are send out with puturl_now and when a worker receives that message it makes the transfer without delay.

I think some of our code operated under the assumption that vine_cache_ensure is synchronous, and a call to it results in an immediate transfer. That is what puturl_now implies.

Now that we introduced the worker managing its transfer bandwidth maybe we need to modify the way the worker interprets puturl_now so it happens immediately again

@dthain
Copy link
Member
dthain commented Apr 16, 2025

Careful - vine_cache_ensure is not synchronous. If the requested file is already present, it returns READY. If not, it (may) start the transfer and return another status indicating NOTYET.

The problem is that transfers don't currently get started unless somebody calls vine_cache_ensure. And (ironically) replication requests are not associated with a task, and so nobody calls ensure on them.

@JinZhou5042 JinZhou5042 changed the title vine: file transfers might be queued up permanently vine: worker transfer starvation Apr 19, 2025
@JinZhou5042 JinZhou5042 changed the title vine: worker transfer starvation vine: starvation in worker transfers Apr 19, 2025
@JinZhou5042
Copy link
Member Author

Resolved in #4130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. critical TaskVine
Projects
None yet
Development

No branches or pull requests

3 participants
0