Description
I was able to reproduce this segfault that I posted several days ago:
In frame 0, it says that item->list->head
is an invalid address:
It seems the addresses of q->waiting_retrieval_list
and its first item's list pointer are different:
Here the q->waiting_retrieval_list
is valid, but its first item's list
is invalid. When we call list_remove
to remove the first item, it internally accesses the invalid list
pointer, resulting in a segfault.
This bug was tricky and took quite some time to track down. In short, the issue wasn't due to a flaw in the list
data structure, but rather a use-after-free of t->library_task
.
A library task itself had previously been freed via vine_task_delete
, but a function task t
wasn't notified and continued to hold a dangling reference t->library_task
.
Later, the same memory address was reused by malloc
for a different struct
. When the manager checked if t->library_task
was NULL
, it returned false, allowing us to write into what it believed was a valid vine_task *
. However, since the memory had been reallocated, the actual type at that address had changed — so writing into it as a vine_task
ended up corrupting unrelated data structures.
Here is the process of debugging:
Here is the detailed structure of the task at the head of q->waiting_retrieval_list
, which has a broken list pointer:
(gdb) p *(struct vine_task *)q->waiting_retrieval_list->head->data
$32 = {task_id = 2224, type = VINE_TASK_TYPE_STANDARD, command_line = 0x555559b46f90 "execute_graph_vertex",
tag = 0x555559b47580 "dag-140736739526480",
category = 0x555559b47d60 "subgraph_callable-884b9064d09fea9d516321ee9d2bdc4c",
monitor_output_directory = 0x0, monitor_snapshot_file = 0x0,
needs_library = 0x555559b47c70 "Dask-Library-140736739526480", provides_library = 0x0,
function_slots_requested = -1, func_exec_mode = 0, input_mounts = 0x555559b477a0,
output_mounts = 0x555559b477f0, env_list = 0x555559b47840, feature_list = 0x555559b47890,
resource_request = CATEGORY_ALLOCATION_FIRST, worker_selection_algorithm = VINE_SCHEDULE_UNSET,
priority = 1745349430.946909, max_retries = 0, max_forsaken = -1, min_running_time = 0,
input_files_size = -1, state = VINE_TASK_WAITING_RETRIEVAL, worker = 0x0, library_task = 0x55555aa9cf70,
library_log_path = 0x0, try_count = 1, forsaken_count = 1, library_failed_count = 0,
exhausted_attempts = 0, forsaken_attempts = 0, workers_slow = 0, function_slots_total = 0,
function_slots_inuse = 0, result = VINE_RESULT_FORSAKEN, exit_code = -1, output_received = 0,
output_length = 0, output = 0x0, addrport = 0x55555a9e6c20 "10.32.88.131:42076",
hostname = 0x55555a93a7b0 "qa-xp-018.crc.nd.edu", time_when_submitted = 1745349524959569,
time_when_done = 1745349530158519, time_when_commit_start = 1745349529791067,
time_when_commit_end = 1745349529791297, time_when_retrieval = 1745349530158501,
time_when_last_failure = 1745349530154706, time_workers_execute_last_start = 0,
time_workers_execute_last_end = 0, time_workers_execute_last = 0, time_workers_execute_all = 0,
time_workers_execute_exhaustion = 0, time_workers_execute_failure = 0, bytes_received = 0,
bytes_sent = 430, bytes_transferred = 430, resources_allocated = 0x555559b47b40,
resources_measured = 0x555559b47a10, resources_requested = 0x555559b478e0, current_resource_box = 0x0,
sandbox_measured = 0, has_fixed_locations = 0, group_id = 0, refcount = 2}
In the first glimpse, the addresses and values look good.
Then, we printed the address information of all items in the q->waiting_retrieval_list
. The list contains four items, each of type struct list_item *
:
(gdb) p q->waiting_retrieval_list->head
$106 = (struct list_item *) 0x55555aa9d030
(gdb) p q->waiting_retrieval_list->head->next
$107 = (struct list_item *) 0x55555aa9cff0
(gdb) p q->waiting_retrieval_list->head->next->next
$108 = (struct list_item *) 0x55555aa9cfb0
(gdb) p q->waiting_retrieval_list->head->next->next->next
$109 = (struct list_item *) 0x55555aa9cf70
Suspiciously, q->waiting_retrieval_list->head->data->library_task
shares the exact same address with q->waiting_retrieval_list->head->next->next->next
!
Both point to 0x55555aa9cf70
, but one is a list item, and the other is the library_task
segment of a task, indicating that one pointer may have been freed before, then that address was re-allocated to another element (the memory is a mess when I print *(vine_task *)q->waiting_retrieval_list->head->data->library_task
, so I'm quite sure that the library task had been released before).
In this case, t->library_task
was previously freed by vine_task_delete
, but t was not notified of this and continues to hold a stale reference. As a result, t->library_task != NULL
still returns true, and the manager continues to use the library task without realizing it's invalid.
This leads to two possible scenarios:
-
If the memory address of
t->library_task
was freed and not yet reallocated, any access to its fields may trigger a segmentation fault. -
A subtler issue arises if the freed address was reallocated for another purpose. In that case, writes to
t->library_task
may succeed, but the behavior becomes completely unpredictable — the data type may no longer bevine_task *
, and we end up corrupting memory by writing over unrelated structures.
Looking back to the back trace, the invocation chain is:
receive_one_task->fetch_outputs_from_worker->reap_task_from_worker->cctools_list_remove
Going through the implementation of these functions, we found that one line looks highly suspicious:
In reap_task_from_worker
, we check t->needs_library
, which is true. Then, we directly write to t->library_task->function_slots_inuse
. However, as previously discussed, t->library_task
had already been freed, and t
continues to hold a dangling reference, assuming it still points to a valid vine_task *
.
In reality, that memory address was reallocated and is now used by an item in q->waiting_retrieval_list
, whose actual type is (struct list_item *)
. So when we write to t->library_task->function_slots_inuse
, we are mistakenly interpreting and modifying a (struct list_item *)
using the layout of a vine_task *
, which corrupts the memory structure — effectively writing garbage into an unrelated list item!
These are our hypotheses. Then, let's monitor this behavior directly by writing to library_task->function_slots_inuse
in gdb , which should replicate the same effect as what happens in the code:
set ((struct vine_task *)$t->library_task)->function_slots_inuse = 9999
And let's print the corrupted address waiting_retrieval_list->head->list
:
(gdb) p q->waiting_retrieval_list->head->list
$79 = (struct list *) 0x2598d596fc1f0
Compared to the originally corrupted address 0x270fd596fc1f0
(shown in the third figure), the address has changed again. For comparison, I printed all other addresses I can find, and they just remained unchanged.
This confirms that writing to t->library_task
alters the memory content of a list item in q->waiting_retrieval_list
, leading to a cascade of downstream segfaults.
This bug is very hard to track down, just after the one with such failure, I immediately conducted another 20 runs and with aggressively killing workers, environment was the same, but none of them had issues. It will only manifest under very rare and coincidental conditions:
- A
library_task
is freed while another task still holds a reference to it. - The exact same address is reallocated by
malloc()
for a different purpose — as alist_item
. - The task (
t
) still writes to the now-invalidt->library_task
as if it were a validvine_task *
. - This write accidentally overlaps with the internal structure of the
list_item
. Ironically, it writes to thelist
field every time, probably due to some memory layout or allocation rules. - The corrupted
list_item
is later passed tolist_remove()
, which dereferences invalid pointers and causes a crash.
And I guess I didn't observe this failure during the paper effort, because I hacked the vine_task_delete
to let it return immediately without releasing any tasks.
To fix this, the simplest way might be to manually scan all tasks when a library task exits abnormally and set their library_task
field to NULL
. But this solution is likely too costly in practice.
It also makes me wonder whether the segfaults I've posted recently might share the same root cause as this one.