This repository was archived by the owner on Nov 2, 2021. It is now read-only.
This repository was archived by the owner on Nov 2, 2021. It is now read-only.
Open
Description
We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:
time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal
This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.
Metadata
Metadata
Assignees
Labels
No labels