vine: segfault due to use-after-free

I was able to reproduce this segfault that I posted several days ago:

In frame 0, it says that item->list->head is an invalid address:

It seems the addresses of q->waiting_retrieval_list and its first item's list pointer are different:

Here the q->waiting_retrieval_list is valid, but its first item's list is invalid. When we call list_remove to remove the first item, it internally accesses the invalid list pointer, resulting in a segfault.

This bug was tricky and took quite some time to track down. In short, the issue wasn't due to a flaw in the list data structure, but rather a use-after-free of t->library_task.

A library task itself had previously been freed via vine_task_delete, but a function task t wasn't notified and continued to hold a dangling reference t->library_task.

Later, the same memory address was reused by malloc for a different struct. When the manager checked if t->library_task was NULL, it returned false, allowing us to write into what it believed was a valid vine_task *. However, since the memory had been reallocated, the actual type at that address had changed — so writing into it as a vine_task ended up corrupting unrelated data structures.

Here is the process of debugging:

Here is the detailed structure of the task at the head of q->waiting_retrieval_list, which has a broken list pointer:

(gdb) p *(struct vine_task *)q->waiting_retrieval_list->head->data
$32 = {task_id = 2224, type = VINE_TASK_TYPE_STANDARD, command_line = 0x555559b46f90 "execute_graph_vertex", 
  tag = 0x555559b47580 "dag-140736739526480", 
  category = 0x555559b47d60 "subgraph_callable-884b9064d09fea9d516321ee9d2bdc4c", 
  monitor_output_directory = 0x0, monitor_snapshot_file = 0x0, 
  needs_library = 0x555559b47c70 "Dask-Library-140736739526480", provides_library = 0x0, 
  function_slots_requested = -1, func_exec_mode = 0, input_mounts = 0x555559b477a0, 
  output_mounts = 0x555559b477f0, env_list = 0x555559b47840, feature_list = 0x555559b47890, 
  resource_request = CATEGORY_ALLOCATION_FIRST, worker_selection_algorithm = VINE_SCHEDULE_UNSET, 
  priority = 1745349430.946909, max_retries = 0, max_forsaken = -1, min_running_time = 0, 
  input_files_size = -1, state = VINE_TASK_WAITING_RETRIEVAL, worker = 0x0, library_task = 0x55555aa9cf70, 
  library_log_path = 0x0, try_count = 1, forsaken_count = 1, library_failed_count = 0, 
  exhausted_attempts = 0, forsaken_attempts = 0, workers_slow = 0, function_slots_total = 0, 
  function_slots_inuse = 0, result = VINE_RESULT_FORSAKEN, exit_code = -1, output_received = 0, 
  output_length = 0, output = 0x0, addrport = 0x55555a9e6c20 "10.32.88.131:42076", 
  hostname = 0x55555a93a7b0 "qa-xp-018.crc.nd.edu", time_when_submitted = 1745349524959569, 
  time_when_done = 1745349530158519, time_when_commit_start = 1745349529791067, 
  time_when_commit_end = 1745349529791297, time_when_retrieval = 1745349530158501, 
  time_when_last_failure = 1745349530154706, time_workers_execute_last_start = 0, 
  time_workers_execute_last_end = 0, time_workers_execute_last = 0, time_workers_execute_all = 0, 
  time_workers_execute_exhaustion = 0, time_workers_execute_failure = 0, bytes_received = 0, 
  bytes_sent = 430, bytes_transferred = 430, resources_allocated = 0x555559b47b40, 
  resources_measured = 0x555559b47a10, resources_requested = 0x555559b478e0, current_resource_box = 0x0, 
  sandbox_measured = 0, has_fixed_locations = 0, group_id = 0, refcount = 2}

In the first glimpse, the addresses and values look good.

Then, we printed the address information of all items in the q->waiting_retrieval_list . The list contains four items, each of type struct list_item *:

(gdb) p q->waiting_retrieval_list->head
$106 = (struct list_item *) 0x55555aa9d030
(gdb) p q->waiting_retrieval_list->head->next
$107 = (struct list_item *) 0x55555aa9cff0
(gdb) p q->waiting_retrieval_list->head->next->next
$108 = (struct list_item *) 0x55555aa9cfb0
(gdb) p q->waiting_retrieval_list->head->next->next->next
$109 = (struct list_item *) 0x55555aa9cf70

Suspiciously, q->waiting_retrieval_list->head->data->library_task shares the exact same address with q->waiting_retrieval_list->head->next->next->next!

Both point to 0x55555aa9cf70, but one is a list item, and the other is the library_task segment of a task, indicating that one pointer may have been freed before, then that address was re-allocated to another element (the memory is a mess when I print *(vine_task *)q->waiting_retrieval_list->head->data->library_task, so I'm quite sure that the library task had been released before).

In this case, t->library_task was previously freed by vine_task_delete, but t was not notified of this and continues to hold a stale reference. As a result, t->library_task != NULL still returns true, and the manager continues to use the library task without realizing it's invalid.

This leads to two possible scenarios:

If the memory address of t->library_task was freed and not yet reallocated, any access to its fields may trigger a segmentation fault.
A subtler issue arises if the freed address was reallocated for another purpose. In that case, writes to t->library_task may succeed, but the behavior becomes completely unpredictable — the data type may no longer be vine_task *, and we end up corrupting memory by writing over unrelated structures.

Looking back to the back trace, the invocation chain is:

receive_one_task->fetch_outputs_from_worker->reap_task_from_worker->cctools_list_remove

Going through the implementation of these functions, we found that one line looks highly suspicious:

In reap_task_from_worker, we check t->needs_library, which is true. Then, we directly write to t->library_task->function_slots_inuse. However, as previously discussed, t->library_task had already been freed, and t continues to hold a dangling reference, assuming it still points to a valid vine_task *.

In reality, that memory address was reallocated and is now used by an item in q->waiting_retrieval_list, whose actual type is (struct list_item *). So when we write to t->library_task->function_slots_inuse, we are mistakenly interpreting and modifying a (struct list_item *) using the layout of a vine_task *, which corrupts the memory structure — effectively writing garbage into an unrelated list item!

These are our hypotheses. Then, let's monitor this behavior directly by writing to library_task->function_slots_inuse in gdb , which should replicate the same effect as what happens in the code:

set ((struct vine_task *)$t->library_task)->function_slots_inuse = 9999

And let's print the corrupted address waiting_retrieval_list->head->list:

(gdb) p q->waiting_retrieval_list->head->list
$79 = (struct list *) 0x2598d596fc1f0

Compared to the originally corrupted address 0x270fd596fc1f0 (shown in the third figure), the address has changed again. For comparison, I printed all other addresses I can find, and they just remained unchanged.

This confirms that writing to t->library_task alters the memory content of a list item in q->waiting_retrieval_list, leading to a cascade of downstream segfaults.

This bug is very hard to track down, just after the one with such failure, I immediately conducted another 20 runs and with aggressively killing workers, environment was the same, but none of them had issues. It will only manifest under very rare and coincidental conditions:

A library_task is freed while another task still holds a reference to it.
The exact same address is reallocated by malloc() for a different purpose — as a list_item.
The task (t) still writes to the now-invalid t->library_task as if it were a valid vine_task *.
This write accidentally overlaps with the internal structure of the list_item. Ironically, it writes to the list field every time, probably due to some memory layout or allocation rules.
The corrupted list_item is later passed to list_remove(), which dereferences invalid pointers and causes a crash.

And I guess I didn't observe this failure during the paper effort, because I hacked the vine_task_delete to let it return immediately without releasing any tasks.

To fix this, the simplest way might be to manually scan all tasks when a library task exits abnormally and set their library_task field to NULL. But this solution is likely too costly in practice.

It also makes me wonder whether the segfaults I've posted recently might share the same root cause as this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions