8000 dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" · Issue #208 · NVIDIA/gpu-monitoring-tools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Nov 2, 2021. It is now read-only.
This repository was archived by the owner on Nov 2, 2021. It is now read-only.
dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208
Open
@SQUIDwarrior

Description

@SQUIDwarrior

We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal

This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0