8000 cri: error out if CRI request has CDI devices but CDI is disabled. by klihub · Pull Request #11975 · containerd/containerd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

cri: error out if CRI request has CDI devices but CDI is disabled. #11975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

klihub
Copy link
Member
@klihub klihub commented Jun 11, 2025

Currently if a container creation requests CDI device injection using the dedicated CRI message field, the request is processed normally but with the CDI devices silently dropped. The in-line request usually indicates that a DRA driver requested injection of the devices. Silent failure to honour the request can result in hard to understand/debug problems later if the container can start up.

This patch addresses this by explicitly checking and rejecting container creation if devices were requested but CDI support is disabled.

@github-project-automation github-project-automation bot moved this to Needs Triage in Pull Request Review Jun 11, 2025
@klihub klihub requested a review from mikebrow June 11, 2025 12:29
@dosubot dosubot bot added the area/cri Container Runtime Interface (CRI) label Jun 11, 2025
@klihub
Copy link
Member Author
klihub commented Jun 11, 2025

/retest

Copy link
Member
@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch

the default for enable_cdi devices is enabled (true)...

I agree we should not be silent about the failure to call cdi setup when enabled and requested, but we may need to allow the container create even if configured w/cdidevices via annotation or the cdidevice field.. otherwise rejecting the container, at this point in time, could possibly cause regressions.

See additional clarification suggestion.

Perhaps a new config flag EnableRejectingContainerCreateForDisabledCDI (default false)

@mikebrow
Copy link
Member

Another thing to consider is deprecating the option to disable cdi

@klihub
Copy link
Member Author
klihub commented Jun 12, 2025

ouch

the default for enable_cdi devices is enabled (true)...

Yes, and TBH that is partly why it took me a while to realize what went wrong. I assumed it to be enabled (so didn't check), and since containers with DRA-injected CDI devices (seemingly) started up normally, I didn't realize to suspect that CDI was disabled.

I agree we should not be silent about the failure to call cdi setup when enabled and requested, but we may need to allow the container create even if configured w/cdidevices via annotation or the cdidevice field..

This patch only prevents the container from starting when device injection was requested using the dedicated CRI protocol field, IOW when it is very explicit that a DRA driver is active.

otherwise rejecting the container, at this point in time, could possibly cause regressions.

But isn't this behavioral change possibly causing regressions only if the former behavior was either deliberate or accidental but well known and useful enough that some people started to use it as a feature and rely on it. Here, it's not the first case, and it is hard to see any usefulness potentially leading to the second.

See additional clarification suggestion.

Perhaps a new config flag EnableRejectingContainerCreateForDisabledCDI (default false)

Would this really be helpful then ? I mean, to prevent the situation I ran into, you would need to alter the configuration asking containerd to error out container creation if you accidentally have CDI disabled somewhere. But why would anybody change the configuration like that, when instead they could just enable CDI while at it ?

@mikebrow
Copy link
Member
mikebrow commented Jun 12, 2025

my scenario is PyTorch....

if a PyTorch enabled application finds the GPU device it will use it.. if not it will fall back to CPU. In this scenario let's say I have a pod spec with a cdi device asking all hosts where the pod is running to mount the gpu accelerator device. In this context I want to run inference with CPU or GPU on each host.. Some hosts may have cdi disabled some may have it explicitly enabled or enabled by default. Further some hosts may be on various versions of containerd where cdi was not originally enabled by default. In other words there is flexibility at the hosts, both container runtimes and the PyTorch applications for running successfully with and without CDI success on each host.

Now with this change many of those hosts will start failing, where the config is set to cdi disabled and the pod spec has a cdi device.

@klihub
Copy link
Member Author
klihub commented Jun 12, 2025

my scenario is PyTorch....

if a PyTorch enabled application finds the GPU device it will use it.. if not it will fall back to CPU. In this scenario let's say I have a pod spec with a cdi device asking all hosts where the pod is running to mount the gpu accelerator device. In this context I want to run inference with CPU or GPU on each host.. Some hosts may have cdi disabled some may have it explicitly enabled or enabled by default. Further some hosts may be on various versions of containerd where cdi was not originally enabled by default. In other words there is flexibility at the hosts, both container runtimes and the PyTorch applications for running successfully with and without CDI success on each host.

Now with this change many of those hosts will start failing, where the config is set to cdi disabled and the pod spec has a cdi device.

But do you really run a GPU DRA plugin in such a cluster ? And on all cluster nodes, with the plugin being oblivious to the fact that some hosts do not support CDI, and still allocating GPUs and seemingly doing GPU injection on the CRI protocol level, with the whole cluster thinking that everything works just fine ? And in reality everything working because then PyTorch just falls back to CPU-based inference on the hosts where CDI is disabled and DRA-based allocation silently fails ?

If that is really the case, or if for some other reason this PR approach is deemed to be a definite no-go, then I think the only alternative we have left is updating this PR to give a big fat warning without erroring out.

@mikebrow
Copy link
Member

my scenario is PyTorch....
if a PyTorch enabled application finds the GPU device it will use it.. if not it will fall back to CPU. In this scenario let's say I have a pod spec with a cdi device asking all hosts where the pod is running to mount the gpu accelerator device. In this context I want to run inference with CPU or GPU on each host.. Some hosts may have cdi disabled some may have it explicitly enabled or enabled by default. Further some hosts may be on various versions of containerd where cdi was not originally enabled by default. In other words there is flexibility at the hosts, both container runtimes and the PyTorch applications for running successfully with and without CDI success on each host.
Now with this change many of those hosts will start failing, where the config is set to cdi disabled and the pod spec has a cdi device.

But do you really run a GPU DRA plugin in such a cluster ? And on all cluster nodes, with the plugin being oblivious to the fact that some hosts do not support CDI, and still allocating GPUs and seemingly doing GPU injection on the CRI protocol level, with the whole cluster thinking that everything works just fine ? And in reality everything working because then PyTorch just falls back to CPU-based inference on the hosts where CDI is disabled and DRA-based allocation silently fails ?

If that is really the case, or if for some other reason this PR approach is deemed to be a definite no-go, then I think the only alternative we have left is updating this PR to give a big fat warning without erroring out.

Is there a way to know at this level if the CDI request is a DRA allocation vs the other uses of CDI and/or CDI annotation experiments.

At any rate it will be a behavior change, arguably for the better, on the off chance someone has been running with the cri config CDI disabled and now those pods that ran before, and arguably should've failed, won't due to the misconfiguration. Yes misconfiguration made worse now by new DRA cases. We should've deprecated the config switch when it was introduced and removed it when CDI became official (no longer an experimental annotation).

Our CRI config deprecation policy https://github.com/containerd/containerd/blob/main/docs/cri/config.md#deprecation follows https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli

from the config document for the v1.x details:

  # enable_cdi enables support of the Container Device Interface (CDI)
  # For more details about CDI and the syntax of CDI Spec files please refer to
  # https://tags.cncf.io/container-device-interface.
  # TODO: Deprecate this option when either Dynamic Resource Allocation(DRA)
  # or CDI support for the Device Plugins are graduated to GA.
  # `Dynamic Resource Allocation` KEP:
  # https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
  # `Add CDI devices to device plugin API` KEP:
  # https://github.com/kubernetes/enhancemen
8000
ts/tree/master/keps/sig-node/4009-add-cdi-devices-to-device-plugin-api
  enable_cdi = true

@klihub
Copy link
Member Author
klihub commented Jun 14, 2025

Is there a way to know at this level if the CDI request is a DRA allocation vs the other uses of CDI and/or CDI annotation experiments.

I think that the injection request coming in the dedicated CRI CDIDevices field is an indirect indication that it was requested by a DRA driver (or then someone is playing around with crictl on the node). That's why this patch proposes to error out only based on that but not if the request is only annotated.

At any rate it will be a behavior change, arguably for the better, on the off chance someone has been running with the cri config CDI disabled and now those pods that ran before, and arguably should've failed, won't due to the misconfiguration. Yes misconfiguration made worse now by new DRA cases. We should've deprecated the config switch when it was introduced and removed it when CDI became official (no longer an experimental annotation).

Our CRI config deprecation policy https://github.com/containerd/containerd/blob/main/docs/cri/config.md#deprecation follows https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli

from the config document for the v1.x details:

  # enable_cdi enables support of the Container Device Interface (CDI)
  # For more details about CDI and the syntax of CDI Spec files please refer to
  # https://tags.cncf.io/container-device-interface.
  # TODO: Deprecate this option when either Dynamic Resource Allocation(DRA)
  # or CDI support for the Device Plugins are graduated to GA.
  # `Dynamic Resource Allocation` KEP:
  # https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
  # `Add CDI devices to device plugin API` KEP:
  # https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4009-add-cdi-devices-to-device-plugin-api
  enable_cdi = true

Yes that makes sense once DRA is GA.

@mikebrow But what's the verdict for this patch ? I can update it to only give a warning instead of erroring out for driver-requested injections with CDI off, if you still think it is the right thing to do ? Just let me know which way we should go.

Currently if a container creation requests CDI device injection
using the dedicated CRI message field, the request is processed
normally but with the CDI devices silently dropped. The in-line
request usually indicates that a DRA driver requested injection
of the devices. Silent failure to honour the request can result
in hard to understand/debug problems later if the container can
start up.

This patch addresses this by explicitly checking and rejecting
container creation if devices were requested but CDI support is
disabled.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
@klihub klihub force-pushed the fixes/main/check-and-reject-disabled-cdi-injection branch from 1656165 to 8009d91 Compare June 15, 2025 10:19
@klihub klihub requested a review from mikebrow June 16, 2025 13:30
@mikebrow
Copy link
Member
mikebrow commented Jun 17, 2025

Is there a way to know at this level if the CDI request is a DRA allocation vs the other uses of CDI and/or CDI annotation experiments.

I think that the injection request coming in the dedicated CRI CDIDevices field is an indirect indication that it was requested by a DRA driver (or then someone is playing around with crictl on the node). That's why this patch proposes to error out only based on that but not if the request is only annotated.

At any rate it will be a behavior change, arguably for the better, on the off chance someone has been running with the cri config CDI disabled and now those pods that ran before, and arguably should've failed, won't due to the misconfiguration. Yes misconfiguration made worse now by new DRA cases. We should've deprecated the config switch when it was introduced and removed it when CDI became official (no longer an experimental annotation).
Our CRI config deprecation policy https://github.com/containerd/containerd/blob/main/docs/cri/config.md#deprecation follows https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli
from the config document for the v1.x details:

  # enable_cdi enables support of the Container Device Interface (CDI)
  # For more details about CDI and the syntax of CDI Spec files please refer to
  # https://tags.cncf.io/container-device-interface.
  # TODO: Deprecate this option when either Dynamic Resource Allocation(DRA)
  # or CDI support for the Device Plugins are graduated to GA.
  # `Dynamic Resource Allocation` KEP:
  # https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
  # `Add CDI devices to device plugin API` KEP:
  # https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4009-add-cdi-devices-to-device-plugin-api
  enable_cdi = true

Yes that makes sense once DRA is GA.

@mikebrow But what's the verdict for this patch ? I can update it to only give a warning instead of erroring out for driver-requested injections with CDI off, if you still think it is the right thing to do ? Just let me know which way we should go.

go ahead and deprecate it for 2.2.. let's make this a warning in 2.2/2.1 announcing 2.2 deprecation.

Do we not migrate this field? Wondering how you got the enable_cdi = false set on your config.. IOW is this a pervasive problem where we added a = true default to a new field and did not migrate from did not exist to new config exists with true set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) size/XS
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

3 participants
0