-
Notifications
You must be signed in to change notification settings - Fork 3.6k
cri: error out if CRI request has CDI devices but CDI is disabled. #11975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
cri: error out if CRI request has CDI devices but CDI is disabled. #11975
Conversation
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch
the default for enable_cdi devices is enabled (true)...
I agree we should not be silent about the failure to call cdi setup when enabled and requested, but we may need to allow the container create even if configured w/cdidevices via annotation or the cdidevice field.. otherwise rejecting the container, at this point in time, could possibly cause regressions.
See additional clarification suggestion.
Perhaps a new config flag EnableRejectingContainerCreateForDisabledCDI (default false)
Another thing to consider is deprecating the option to disable cdi |
Yes, and TBH that is partly why it took me a while to realize what went wrong. I assumed it to be enabled (so didn't check), and since containers with DRA-injected CDI devices (seemingly) started up normally, I didn't realize to suspect that CDI was disabled.
This patch only prevents the container from starting when device injection was requested using the dedicated CRI protocol field, IOW when it is very explicit that a DRA driver is active.
But isn't this behavioral change possibly causing regressions only if the former behavior was either deliberate or accidental but well known and useful enough that some people started to use it as a feature and rely on it. Here, it's not the first case, and it is hard to see any usefulness potentially leading to the second.
Would this really be helpful then ? I mean, to prevent the situation I ran into, you would need to alter the configuration asking containerd to error out container creation if you accidentally have CDI disabled somewhere. But why would anybody change the configuration like that, when instead they could just enable CDI while at it ? |
my scenario is PyTorch.... if a PyTorch enabled application finds the GPU device it will use it.. if not it will fall back to CPU. In this scenario let's say I have a pod spec with a cdi device asking all hosts where the pod is running to mount the gpu accelerator device. In this context I want to run inference with CPU or GPU on each host.. Some hosts may have cdi disabled some may have it explicitly enabled or enabled by default. Further some hosts may be on various versions of containerd where cdi was not originally enabled by default. In other words there is flexibility at the hosts, both container runtimes and the PyTorch applications for running successfully with and without CDI success on each host. Now with this change many of those hosts will start failing, where the config is set to cdi disabled and the pod spec has a cdi device. |
But do you really run a GPU DRA plugin in such a cluster ? And on all cluster nodes, with the plugin being oblivious to the fact that some hosts do not support CDI, and still allocating GPUs and seemingly doing GPU injection on the CRI protocol level, with the whole cluster thinking that everything works just fine ? And in reality everything working because then PyTorch just falls back to CPU-based inference on the hosts where CDI is disabled and DRA-based allocation silently fails ? If that is really the case, or if for some other reason this PR approach is deemed to be a definite no-go, then I think the only alternative we have left is updating this PR to give a big fat warning without erroring out. |
Is there a way to know at this level if the CDI request is a DRA allocation vs the other uses of CDI and/or CDI annotation experiments. At any rate it will be a behavior change, arguably for the better, on the off chance someone has been running with the cri config CDI disabled and now those pods that ran before, and arguably should've failed, won't due to the misconfiguration. Yes misconfiguration made worse now by new DRA cases. We should've deprecated the config switch when it was introduced and removed it when CDI became official (no longer an experimental annotation). Our CRI config deprecation policy https://github.com/containerd/containerd/blob/main/docs/cri/config.md#deprecation follows https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli from the config document for the v1.x details:
|
I think that the injection request coming in the dedicated CRI
Yes that makes sense once DRA is GA. @mikebrow But what's the verdict for this patch ? I can update it to only give a warning instead of erroring out for driver-requested injections with CDI off, if you still think it is the right thing to do ? Just let me know which way we should go. |
Currently if a container creation requests CDI device injection using the dedicated CRI message field, the request is processed normally but with the CDI devices silently dropped. The in-line request usually indicates that a DRA driver requested injection of the devices. Silent failure to honour the request can result in hard to understand/debug problems later if the container can start up. This patch addresses this by explicitly checking and rejecting container creation if devices were requested but CDI support is disabled. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
1656165
to
8009d91
Compare
go ahead and deprecate it for 2.2.. let's make this a warning in 2.2/2.1 announcing 2.2 deprecation. Do we not migrate this field? Wondering how you got the enable_cdi = false set on your config.. IOW is this a pervasive problem where we added a = true default to a new field and did not migrate from did not exist to new config exists with true set. |
Currently if a container creation requests CDI device injection using the dedicated CRI message field, the request is processed normally but with the CDI devices silently dropped. The in-line request usually indicates that a DRA driver requested injection of the devices. Silent failure to honour the request can result in hard to understand/debug problems later if the container can start up.
This patch addresses this by explicitly checking and rejecting container creation if devices were requested but CDI support is disabled.