vine: starvation in worker transfers #4116

JinZhou5042 · 2025-04-15T20:05:48Z

I'm fairly certain this is the primary issue causing the significant slowdown at the end of large workflows.

Previously, I tried plotting the number of incoming and outgoing transfers on each worker to check whether file transfers were fast enough and whether they might be the bottleneck. I consistently observed that the manager scheduled transfers early on, but never received responses indicating whether the transfers succeeded or failed. I initially assumed this was a bug in the plotting logic.

For example, the following figure shows the number of incoming transfers on each worker (each line represents a worker). It seems that some transfers got stuck on certain workers without making any progress:

It turned out that there was a bug in the worker's file fetching logic. In vine_cache_ensure, the worker forks a process to fetch files from other workers (same thing as vine_transfer_server.c does), and the number of concurrent fetch processes is limited by c->max_transfer_procs. However, once this threshold is reached, additional transfers are silently delayed forever.

This creates a serious issue: the manager enforces a throttle (q->worker_src_max_transfers, default 10) per worker, assuming that each scheduled transfer is actively being handled. If a transfer never completes and no response is returned, the manager incorrectly assumes the worker is still busy and avoids scheduling further transfers to it.

My tentative solution is pretty straightforward and gives me consistently great performance: the concurrency throttle is still 10, but we periodically check the pending transfers and try to shcedule them in the worker's main loop. The funciton looks like:

int vine_cache_process_pending(struct vine_cache *c)
{
	int processed = 0;

	int num_processing = list_size(c->processing_transfers);
	int num_pending = list_size(c->pending_transfers);
	printf("num_processing: %d, num_pending: %d\n", num_processing, num_pending);

	while (list_size(c->processing_transfers) < c->max_transfer_procs && list_size(c->pending_transfers) > 0) {
		char *queue_cachename = list_pop_head(c->pending_transfers);
		if (!queue_cachename) {
			break;
		}
		vine_cache_status_t status = vine_cache_ensure(c, queue_cachename);
		if (status == VINE_CACHE_STATUS_PROCESSING) {
			char *new_queue_name = strdup(queue_cachename);
			list_push_tail(c->processing_transfers, new_queue_name);
			processed++;
		} else if (status == VINE_CACHE_STATUS_PENDING) {
			char *new_queue_name = strdup(queue_cachename);
			list_push_tail(c->pending_transfers, new_queue_name);
		}
		free(queue_cachename);
	}

	return processed;
}

It turned out that the manager’s throttle can be set to a very large value (e.g., 10000), while each worker handles the incoming and outgoing concurrency. All transfers (regardless of success or failure) complete very fast without causing delays.

The text was updated successfully, but these errors were encountered:

dthain · 2025-04-15T20:17:23Z

Glad you found it, I'm not quite sure that I understand the solution. Make a PR?

JinZhou5042 · 2025-04-15T20:23:04Z

Will do that.

dthain · 2025-04-16T16:12:39Z

Per our discussion today, the problem:

Normal file transfer jobs will eventually get run because they are required by tasks that will call vine_cache_ensure() on that file.
However, file transfer jobs requested simply for replication are not associated with any task, and so we may just forget to call vine_cache_ensure(). And so they never get run.
Eventually, the manager will stop sending file transfer requests to the worker, and the workflow gets stuck.

Proposed solution:

Simplest: Periodically attempt to start all transfers (within concurrency limits). That removes the "priority" in favor of task-related file transfers.
A little more complicated: Periodically attempt to start all transfers with the NOW flag, because those are associated with replication and not with tasks.

Question: Why exactly do we distinguish between put url and put url_now. What problem does that solve, and is it acceptable to unify the cases?

@colinthomas-z80 what do you think?

colinthomas-z80 · 2025-04-16T17:35:47Z

My understanding was that replication transfer jobs are send out with puturl_now and when a worker receives that message it makes the transfer without delay.

I think some of our code operated under the assumption that vine_cache_ensure is synchronous, and a call to it results in an immediate transfer. That is what puturl_now implies.

Now that we introduced the worker managing its transfer bandwidth maybe we need to modify the way the worker interprets puturl_now so it happens immediately again

dthain · 2025-04-16T17:42:18Z

Careful - vine_cache_ensure is not synchronous. If the requested file is already present, it returns READY. If not, it (may) start the transfer and return another status indicating NOTYET.

The problem is that transfers don't currently get started unless somebody calls vine_cache_ensure. And (ironically) replication requests are not associated with a task, and so nobody calls ensure on them.

JinZhou5042 · 2025-05-27T19:29:05Z

Resolved in #4130

JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Apr 16, 2025

dthain added the critical label Apr 16, 2025

dthain assigned JinZhou5042 Apr 16, 2025

JinZhou5042 mentioned this issue Apr 16, 2025

vine: another issue causing large workflows being stuck forever #4084

Closed

JinZhou5042 changed the title ~~vine: file transfers might be queued up permanently~~ vine: worker transfer starvation Apr 19, 2025

JinZhou5042 mentioned this issue Apr 19, 2025

vine: periodically process pending transfers #4130

Merged

7 tasks

JinZhou5042 changed the title ~~vine: worker transfer starvation~~ vine: starvation in worker transfers Apr 19, 2025

JinZhou5042 closed this as completed May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vine: starvation in worker transfers #4116

vine: starvation in worker transfers #4116

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vine: starvation in worker transfers #4116

vine: starvation in worker transfers #4116

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!