-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support scheduling tolerating workloads on NotReady Nodes #45717
Comments
/kind feature |
@luxas The title is confusing. If a node is not reachable at all, why should scheduler schedule any pods on it and even if the scheduler does, how does the node know that it should run the pod? I can see how this may work if the node is not ready for reasons other than reachability though. |
+1 to this, some customer want to use |
similar question with @bsalamat , is there any condition to identify "network" issues? If so, that'll be great to have such a feature. |
+1 for this feature. For baremetal node, you probably will use host networking for the performance reasons. You would not want to cripple it with using overlay networking. |
@bsalamat @k82cn Sorry for the confusion, was in a hurry when I wrote the issue description initially. This is not about connectivity, full connectivity is assumed at all times. This is about nodes that have some known "problem" e.g. NetworkNotReady. That condition means that the node is reachable over the network but no CNI network (e.g. kubenet, Weave or Calico) is installed. However, currently, when one of the five-ish node conditions that exist are falsy, the Node goes unschedulable due to the NodeNotReady condition. If one sub-condition (NetworkNotReady is true), the full node condition goes falsy. We should be able to do more fine-grained scheduling than that. In the CNI networking case you might want to run some workloads (of type hostNetwork=true obviously) on nodes with no CNI network installed for bootstrapping or other reasons. One possible solution (there are quite a few):
Example of the current state:
Here is the code that probably should be changed in some way (
cc @kubernetes/sig-cluster-lifecycle-bugs @kubernetes/sig-cluster-lifecycle-misc |
Currently we could workaround this by using DaemonSets as they aren't using the default scheduler: #42002 |
xref : #42001 |
Indeed someone is already working on this (#42406). Feel free to comment on that PR. I think there are some issues with it, for example it appears to be using one taint for everything instead of one taint per condition. I haven't had a chance to look at it.
We should not hide behavior like this behind flags. If a pod wants to schedule on a node with NetworkNotReady taint, it should have an explicit toleration for it. |
Just go through #42406 , it handled this case by taint for everything as davidopp said. Appended a comments in #42406 for this case. @luxas , I think we should not mark |
slack with luxas, if do not mark |
That being said +1 to having separate Taints for each Condition. |
The example for host-network=true can be leveraged with kubenet. Kubenet is not an official CNI plugin, but it works for the purpose that you mentioned. Does the noop plugin not work anymore -
|
This issue (and #44445) led @gmarek and me to realize that there is not a good understanding and agreement of what the actual and desired behavior is for node conditions (and for the taints that we want to replace them with for scheduling purposes) in general. Before we try to do anything, we need to take a step back and get this understanding. @gmarek and I started talking about all of the scenarios and it turns out that it's incredibly complicated. Anyway, @gmarek filled a spreadsheet which people should take a look at and make comments/corrections to. |
Sure - the doc is here |
IMO there is a conflation of "Conditions" with an implicit state vs. having an explicit state machine. This leads to the glue logic that exists in kubelet, and if we follow this track it would make it's way into the scheduler. |
@davidopp I meant that the feature to be able to take NotReady nodes into account could be added behind a feature flag for the scheduler possibly already in v1.7?
Great @gmarek, I think that's step number one
@gmarek I thought that
@timothysc I'm not sure I understand how this glue logic would make its way into the scheduler. There wouldn't be any new code added to the scheduler really, only proposal (from me, at least) is to allow taking |
@luxas - my understanding is different. I think that all conditions are independent, and |
I'm bumping priority down now that I understand the context of where kubeadm self hosting is at. |
Yeah, I think we need to fix this behavior before we change the DaemonSet controller to use the default scheduler. Various CNI network providers (Calico, Flannel, Weave) rely on DaemonSets being able to schedule hostNetwork pods when Another use case for this that we hear a lot is the ability to deploy networking via helm. Helm requires the tiller pod to run on the cluster before being able to deploy any charts. Given the current behavior this makes it impossible to deploy network daemonsets via helm. |
Ah, this is even worse than I thought...
See: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L658 var newNodeReadyCondition v1.NodeCondition
rs := append(kl.runtimeState.runtimeErrors(), kl.runtimeState.networkErrors()...)
if len(rs) == 0 {
newNodeReadyCondition = v1.NodeCondition{
Type: v1.NodeReady,
Status: v1.ConditionTrue,
Reason: "KubeletReady",
Message: "kubelet is posting ready status",
LastHeartbeatTime: currentTime,
}
} else {
newNodeReadyCondition = v1.NodeCondition{
Type: v1.NodeReady,
Status: v1.ConditionFalse,
Reason: "KubeletNotReady",
Message: strings.Join(rs, ","),
LastHeartbeatTime: currentTime,
}
} I took for granted without even looking deeper at it that NetworkReady actually was a Kubernetes Node condition, but it isn't. It's a CRI Condition, which can't be used for scheduling. I really think that the NetworkPluginNotReady CRI condition should be reported as a Node Condition as well instead of being baked into the general KubeletNotReady condition. Now we lack granularity in all kinds of ways. And we can't remove the NetworkPluginNotReady -> KubeletNotReady behavior anymore, since that's GA functionality now.
PTAL @kubernetes/sig-node-bugs. Is there any specific reason the CRI condition reporting was designed as-is? |
Reporting unready network plugins as part of the node ready condition predated CRI. In fact, the logic was there at least on or before kubernetes 1.2, before taints/toleration was introduced. I do think we should clarify/re-define the conditions and interactions with the control plane. |
I think this is bad, but I don't think it's release blocker bad. I do agree it's feature blocker bad for both "self-hosted" and daemonsets->scheduler. Does anyone have cycles to investigate? |
@timothysc Totally agree. Won't affect the release since the behavior has existed a couple of releases already.
I might have some time, let's see how things turn out
@yujuhong Where should we have this conversation? |
@luxas @yujuhong - there's an started work in #42406, that I didn't have time to properly drive. First step of cleaning up all this mess was gathering the data I shared in a trix. Next step is to decide what semantics we want to have (and which behavior we want to break), which I planned to do after the freeze (and after I dig myself out of stuff that accumulates during it). |
cc @davidopp |
Any update on this? I just ran into this issue again. |
there's an alpha feature in 1.9 (TaintNodeByCondition), please try it :). |
I think we can close this as it's implemented in alpha and the feature tracking state is here: kubernetes/enhancements#382 |
What was the final result of this? I could not find documentation that said, "here is how you schedule a I think the net result is that you:
But I am unsure. Is that correct? How do I enable them and what is the correct toleration? Is there an example? |
@deitch i am also interested as I would like to deploy calico via a helm chart which would require this functionality. |
@swade1987 It's already been implemented, the taint key is: |
Last I looked, running tiller as a statefulset/deployment with net=host and the toleration still didn't let it schedule. maybe that has changed though? |
For the record, this has worked for me with today's master (pre-1.11):
|
Documentation: Could definitely be more clear. Not possible to find without knowing the name of the taint. |
This has come up a few times; right now the kubelet goes NotReady if the CNI network isn't set up which makes the scheduler not even consider scheduling to that node, not even hostNetwork Pods.
I think that it's in the works to mark Node problems with Taints, where are we with that effort @kubernetes/sig-scheduling-misc?
We should make it possible to schedule workloads on nodes that are NotReady if the workload tolerates all the required taints for the specific condition.
This is critical for kubeadm self-hosting for example.
I will add more context on this later, but starting the discussion at least here... didn't find any other issue although it may exist
@kubernetes/sig-node-feature-requests @kubernetes/sig-scheduling-feature-requests @kubernetes/sig-network-feature-requests
The text was updated successfully, but these errors were encountered: