Description
NOTE: Remember to label this issue with "
ci: sev
"
See https://www.githubstatus.com/incidents/d9xd9k1j6sl0
At the time of creating this issue, I think the effect of the incident are no longer happening, but I'm creating this issue anyways in case we need follow ups
Current Status
Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase).
closed?
Error looks like
https://github.com/pytorch/pytorch/actions/runs/15618165793/job/43996868331
GH jobs failing at checkout step
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/eigen'...
Error: fatal: remote error: GitLab is currently unable to handle this request due to load (ID 01JXJQ2JT2E5G1N9QH5WW0NGSE).
Error: fatal: clone of 'https://gitlab.com/libeigen/eigen.git' into submodule path '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/eigen' failed
Failed to clone 'third_party/eigen' a second time, aborting
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/composable_kernel'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'...
Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'...
Error: The process '/usr/bin/git' failed with exit code 1
Incident timeline (all times pacific)
Include when the incident began, when it was detected, mitigated, root caused, and finally closed.
Reported by nikita at 11:36 AM
Not sure which GH job failed first but:
https://hud.pytorch.org/hud/pytorch/pytorch/7986c0dba6e1044d90b7f607f9cca15922339bb4/1?per_page=100&mergeEphemeralLF=true
User impact
How does this affect users of PyTorch CI?
Failing jobs at checkout
Root cause
What was the root cause of this issue?
GH incident
Mitigation
How did we mitigate the issue?
Prevention/followups
How do we prevent issues like this in the future?