Description
This is a race condition where the cache-update
message may arrive after the file has already been deleted (unlink
) on the worker. As a result, the manager mistakenly believes the file is still available, leading to a series of issues. For example:
-
The replica table becomes inconsistent, with incorrect replica counts. This can cause chaos under heavy file volumes and frequent file pruning.
-
Tasks are scheduled to workers without their inputs. When inputs are found missing, the worker forsakes the tasks. Worse still, if that worker is the only replica holder, the task is rescheduled repeatedly and forsaken over and over again, resulting in a deadlock.
The following figure illustrates this symtom:
At T3, the following two things happen simultaneously:
- the manager unlinks a file from two sources and clears the replica table
- one worker has got that file and sent the
cache-update
message, but this message is on the way or queued up in the link
Eventually, the two workers correctly remove the file upon receving unlink
, but the manager wrongly updates the replica table when receving the outdated cache-update