8000 Add networkpolicies by karampok · Pull Request #2773 · metallb/metallb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add networkpolicies #2773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Add networkpolicies #2773

wants to merge 5 commits into from

Conversation

karampok
Copy link
Collaborator
@karampok karampok commented May 26, 2025

/kind feature

What this PR does / why we need it:

Special notes for your reviewer:
Adding networkpolicies support, and that is an opt-in feature.
E2E test are executed with network policies enabled.

Release note:

 Add network policy support to Helm charts and manifests

    - Network policies are disabled by default. User can enable them in Helm
      by setting the value `networkpolicies.enabled` true. When deploying
      using manifests the user should modify the config/native/kustomization.yaml
      to include the network policies.

    - User can install a default deny network policy rule if needed. In Helm
      by setting the chart value `networkpolicies.defaultDeny` to true. In
      manifest by modifying accordingly the config/networkpolicies/kustomization.yaml
      to include or exclude the deny_all policy.

@karampok karampok force-pushed the networkpolicies branch 6 times, most recently from 8e246aa to b518762 Compare June 2, 2025 07:33
@karampok karampok changed the title Add networkpolicies in the helm chart Add networkpolicies Jun 2, 2025
@karampok karampok marked this pull request as ready for review June 2, 2025 11:18
@karampok karampok requested review from fedepaol and oribon as code owners June 2, 2025 11:18
{{- include "metallb.selectorLabels" . | nindent 6 }}
app.kubernetes.io/component: controller
egress:
- ports:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to the commit message why this exact ports (and same for ingress)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commit message

app: metallb
component: controller
egress:
- ports:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't need 443 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bz

 oc get svc kubernetes -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2025-06-10T08:42:56Z"
  labels:
    component: apiserver
    provider: kubernetes
  name: kubernetes
  namespace: default
  resourceVersion: "195"
  uid: 8809a655-eee9-4e1d-863a-3147dbe44424
spec:
  clusterIP: 10.96.0.1
  clusterIPs:
  - 10.96.0.1
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 6443
  sessionAffinity: None
  type: ClusterIP

how the controller speaks is through the svc and the networkpolicies seems to operate on the target port (at least in kind cluster and in OCP). I cleaned up the port egress port 443 from the helm chart .

Having said that I will search deeper if there is standardized documenation how the CNI network should work on the targetPort or not.

Copy link
Member
@oribon oribon Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary (nor visible/informative enough), I think we should just drop it. (if our networkpolicies caused an error then it is a bug, and if custom policies were used we can't shield users from a misconfig)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, it's not worth the extra complexity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, commit dropped

ingress:
- ports:
{{- if .Values.controller.webhookMode }}
- protocol: TCP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's gonna be more future-proof if we use named ports. IIUC the spec allows it https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.33/#networkpolicyport-v1-networking-k8s-io

If we don't have these ports under a name, we can add them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
(nice tip)

@@ -68,6 +68,7 @@ Kubernetes: `>= 1.19.0-0`
| imagePullSecrets | list | `[]` | |
| loadBalancerClass | string | `""` | |
| nameOverride | string | `""` | |
| networkpolicies.defaultDeny | bool | `false` | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use case for this vs disabling / enabling netpols at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example if you install metallb in a different namespace, where other things might exist you want your policies but not a default-deny that will break the other things. In OCP, the plan is eventually external component to install default rules and not added by the components. If you like we can exclude it from the helm API (but in that case the metallb-operator will add it somehow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, it's not worth the extra complexity

egress:
- ports:
- protocol: TCP
port: 6443
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be available to override, as we don't have control over them. Also, the speaker reaches out the apiserver too, but I guess we are not adding it as it's host networked. Can you add an explicit commetn about it?

@fedepaol
Copy link
Member
fedepaol commented Jun 9, 2025

Also, please enrich the release notes a bit saying what we are going to block. Also also, the network policies are enabled if you deploy with manifests / kustomize

@karampok karampok force-pushed the networkpolicies branch 3 times, most recently from a373369 to 2fb7825 Compare June 10, 2025 14:29
@karampok
Copy link
Collaborator Author

@fedepaol @oribon I think I have addressed all your comments, ptal thank you

@fedepaol
Copy link
Member

Looks good! Would it make sense to add a test ensuring that the network policies are effectively working?
I am thinking about curling to / from a different pod.

@karampok
Copy link
Collaborator Author
karampok commented Jun 12, 2025

Looks good! Would it make sense to add a test ensuring that the network policies are effectively working? I am thinking about curling to / from a different pod.

That is a bit tricky.
There is a default-deny, if our network-policy are not correct then all test are failling because operator never healthy. If the kind cluster does not support netpol (which it does), we could have not-related metallb-test to verify it (but I think it not very meaningful).

  1. We could have have one other pod (kubectl run pod ) in the namespace to try to reach the metrics port to verify that ingress traffic is blocked (but that mean we should have add a check if netpol enable and default deny exist, and we should logic to grub the pod IP and such)
  2. I cannot thing of good way to test the egress, it should have like kubectl debug pod to have a shell in the controller network namespace

Is that something you have in mind?

(update maybe we could have that test in the operator,)

IMO it's not worth the extra complexity but we can do it

@fedepaol
Copy link
Member

Looks good! Would it make sense to add a test ensuring that the network policies are effectively working? I am thinking about curling to / from a different pod.

That is a bit tricky. There is a default-deny, if our network-policy are not correct then all test are failling because operator never healthy. If the kind cluster does not support netpol (which it does), we could have not-related metallb-test to verify it (but I think it not very meaningful).

Not sure I understood this part.

1. We could have have one other pod (`kubectl run pod `) in the namespace to try to reach the metrics port to verify that ingress traffic is blocked (but that mean we should have add a check `if netpol enable and default deny exist`, and we should logic to grub the pod IP and such)

2. I cannot thing of good way to test the egress, it should have like `kubectl debug pod` to have a shell in the controller network namespace

Is that something you have in mind?

(update maybe we could have that test in the operator,)

IMO it's not worth the extra complexity but we can do it

I am not comfortable in saying "it works" if we don't have a way to verify it.
What I was suggesting was to spin up a pod that we try to reach from the controller and that tries to curl the controller, to ensure that the netpol we are configuring are really effective.

@karampok
Copy link
Collaborator Author
karampok commented Jun 12, 2025

we try to reach from the controller

how do you suggest to do that given we have distroless container without a shell?

@fedepaol
Copy link
Member

we try to reach from the controller

how do you suggest to do that given we have distroless container without a shell?

https://metallb.io/troubleshooting/#how-to-debug-the-speaker-and-the-controller-containers

@karampok karampok force-pushed the networkpolicies branch 9 times, most recently from 42963bb to 956e4e8 Compare June 17, 2025 12:54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the rationale of this change (is it to be able to run the same debug twice?)?
if that's the reason, it seems a bit overcomplex (vs what we had before), is it possible (if reasonable) to just randomize the name of the ephemeral container instead when Execing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug container is a container attached to the pod, and we can do from
there normal kubectl exec commands. In that way we avoid running any
-it argument neither adding more than one debug container per target
container.

Having the go client instead of kubectl is more flexible and is required
to catch errors as well as to wait for the debug pod to be ready

(from the commit message).
I think that is more recommended implementation and not so complex imo.

Having an -interactive shell when doing not interactive call wrong (if I remove it it fails)
Return values of the command is more straightforward.
Speed also is reason ( exec is faster) if we want to extend the number of debug command
Debug container cannot be removed so kind the dynamic name will leak as many as calls as many containers
Having a go-client implementation is better (we still use kubectl but we can start not using it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is why specifically we need this change, what are we doing now that we couldn't do before? is it the if we want to extend the number of debug command part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I revert and have in test code

		ctrlExec := executor.ForPodDebug(controller.Namespace, controller.Name, "controller", agnhostImage)
		out, err = ctrlExec.Exec("./agnhost", "connect", net.JoinHostPort(probeIP, "8080"), "--timeout", "5s")
		Expect(err).NotTo(HaveOccurred()) # this error is nil, which in the refactor case is not nil as expected
		Expect(out).To(ContainSubstring("TIMEOUT")) // If no netpol then REFUSED

then the test is failing


SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
Networkpolicies only allowed traffic
/home/kka/workspace/github.com/metallb/wts/up-np-metallb/e2etest/netpoltests/networkpolicies.go:89
  STEP: checking ingress - probe pod to any port on the controller other than webhook should be timeout @ 06/19/25 12:56:10.11
  STEP: checking egress - controller pod to any port on the probe other port than API should be timeout @ 06/19/25 12:56:15.324
  [FAILED] in [It] - /home/kka/workspace/github.com/metallb/wts/up-np-metallb/e2etest/netpoltests/networkpolicies.go:104 @ 06/19/25 12:56:15.88
• [FAILED] [19.888 seconds]
Networkpolicies [It] only allowed traffic
/home/kka/workspace/github.com/metallb/wts/up-np-metallb/e2etest/netpoltests/networkpolicies.go:89

  [FAILED] Expected
      <string>: Targeting container "controller". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
      Unable to use a TTY - input is not a terminal or the right kind of file
      If you don't see a command prompt, try pressing enter.
      warning: couldn't attach to pod/controller-765bbcb6fc-dzcv4, falling back to streaming logs: error stream protocol error: unknown error

  to contain substring
      <string>: TIMEOUT
  In [It] at: /home/kka/workspace/github.com/metallb/wts/up-np-metallb/e2etest/netpoltests/networkpolicies.go:104 @ 06/19/25 12:56:15.88
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
--

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is hiding the real stdout from the process, and me trying to fix it I refactor, did not dig deeper (as refactor seemed the right approach)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/home/kka/workspace/github.com/metallb/wts/up-np-metallb/build/kubectl debug -n metallb-system -c controller-debugger-x --target=controller --image=registry.k8s.io/e2e-test-images/agnhost:2.47 controller-765bbcb6fc-dzcv4 -- ./agnhost connect 10.244.1.7:8080 --timeout 5s
Targeting container "controller". If you don't see processes from this container it may be because the container runtime doesn't support this feature.

just running the command I do not see the stdout

Copy lin 10000 k
Collaborator Author
@karampok karampok Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with -it (which we should not have in theory) I can see but there is some time waiting (kind the go code does not wait but it returns)

/home/kka/workspace/github.com/metallb/wts/up-np-metallb/build/kubectl debug -it -n metallb-system -c controller-debugger-xxx --target=controller --image=registry.k8s.io/e2e-test-images/agnhost:2.47 controller-765bbcb6fc-dzcv4 -- ./agnhost connect 10.244.1.7:8080 --timeout 5s
Targeting container "controller". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
If you don't see a command prompt, try pressing enter.
 --- here is hanging  for a bit ---

TIMEOUT
Session ended, the ephemeral container will not be restarted but may be reattached using 'kubectl attach controller-765bbcb6fc-dzcv4 -c controller-debugger-xxx -i -t' if it is still running

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-word the commit description with our chat findings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing I still don't understand from the discussion is why the current test using it is failing. If the command was never returning the right value, I am not able to justify the mac assertion here

selectorMac, err := mac.GetIfaceMac(NodeNics[0], executor.ForPodDebug(speakerPod.Namespace, speakerPod.Name, "speaker", agnostImage))

You probably chatted about this, but the conclusion in the commit message seem either not accurate or the current test is volkswagening

therefore it
returns neither the correct output (= only the output from the
command we want to run), neither the error (if the command fails the
kubectl debug actually returns ok).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the observation is that for the command that returns instantly (whoami/ifconfig), it works just fine. It
does not work fine for the commands that will timeout/exit. I re-wrote the commit so it is clear that the above sentence is only valid for the specific type of commans (which is agnhost connect --timeout).

Comment on lines 153 to 159
func CreatePod(cs clientset.Interface, p *corev1.Pod) (*corev1.Pod, error) {
pod, err := cs.CoreV1().Pods("default").Create(context.TODO(), p, metav1.CreateOptions{})
if err != nil {
return nil, err
}
return WaitForPodReady(cs, pod.Namespace, pod.Name)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit opinionated (the "default" part at least, I guess it should be p.Namespace), I think it makes sense to omit this function and the next, creating/deleting directly where needed

Copy link
Collaborator Author
@karampok karampok Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I fixed it, it was mistake. I could skip the function but I should skip also the delete function and then the main code will look more packed ( I think I also got feedback to hide those under function in the past)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough (though I guess we can now omit the Delete part since deleting the namespace takes care of that)

tpod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "netpol-test",
Namespace: "default", // probe pod must not be in the metallb-system namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a new ns for this test (and remove it in the defer)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can move into a random new namespace, I see not argument for/against though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to use test-namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see not argument for/against though

we should avoid polluting namespaces (even if not very probable)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@karampok karampok force-pushed the networkpolicies branch 3 times, most recently from 04d7aa8 to 1143368 Compare June 19, 2025 12:12
Copy link
Member
@oribon oribon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits, lgtm otherwise

}

func ForPodDebug(namespace, name, container, image string) Executor {
return &podDebugExecutor{
func ForPodDebug(cs clientset.Interface, namespace, podName, targetContainer string) (Executor, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass the image as input like before instead of hardcoding agnhost?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cs: cs,
}

ephemeralName := "debugger-" + targetContainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop this and use ret.ephemeral (or drop references to ret.ephemeral, put this in ret and always use it)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

return ret, wait.PollUntilContextTimeout(context.Background(), 1*time.Second, 20*time.Second, false, func(context.Context) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can the wait be before the return, and the usual if err != nil after?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don

})
})

func expectedPolicies() ([]networkingv1.NetworkPolicy, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary (checking that the objs are there), e.g for the same reasons we don't check the webhook objects are there before running the webhook e2es. I guess we can keep it if you see there's value in this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feeling, I will wait @fedepaol feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on this being redundant, we want to ensure the behavior, not the configuration that leads to the behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

probe, err = metallb.CreatePod(cs, tpod)
Expect(err).NotTo(HaveOccurred())
ginkgo.DeferCleanup(func() {
err = metallb.DeletePod(cs, probe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is redundant (taken care by deleting the namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- protocol: TCP
port: 6443
ingress:
- ports:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also have here monitoring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? I added as a commented, in all case this is not enabled by default, t user must actively change the manifest (and we should not recommend use the non https port)

- Network policies are disabled by default. User can enable them in Helm
  by setting the value `networkpolicies.enabled` true. When deploying
  using manifests the user should modify the config/native/kustomization.yaml
  to include the network policies.

- User can install a default deny network policy rule if needed. In Helm
  by setting the chart value `networkpolicies.defaultDeny` to true. In
  manifest by modifying accordingly the config/networkpolicies/kustomization.yaml
  to include or exclude the deny_all policy.

E2E tests are adding the network policies and the default deny rule
deny_all as part of the setup in the tasks.py file.

The Egress network policy port for the k8s API is 6443 because
the connection takes place through the k8s service, (as shown below).
This should be safe but in Helm charts there is a value to override it
if needed.

```
kubectl get svc kubernetes -o yaml
apiVersion: v1
kind: Service
metadata:
  name: kubernetes
  namespace: default
spec:
...
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 6443
...
```

Signed-off-by: Konstantinos Karampogias <karampok@gmail.com>
@@ -25,7 +25,7 @@ import (
"go.universe.tf/e2etest/pkg/status"
)

const agnostImage = "registry.k8s.io/e2e-test-images/agnhost:2.45"
const agnhostImage = "registry.k8s.io/e2e-test-images/agnhost:2.47"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this change? In any case, not belongs to this commit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it from the commit

}

if !containerExists {
pod.Spec.EphemeralContainers = append(pod.Spec.EphemeralContainers, ephemeralContainer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this to a "addEphemeralContainerToPod" function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (if I understood your ask correctly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing I still don't understand from the discussion is why the current test using it is failing. If the command was never returning the right value, I am not able to justify the mac assertion here

selectorMac, err := mac.GetIfaceMac(NodeNics[0], executor.ForPodDebug(speakerPod.Namespace, speakerPod.Name, "speaker", agnostImage))

You probably chatted about this, but the conclusion in the commit message seem either not accurate or the current test is volkswagening

therefore it
returns neither the correct output (= only the output from the
command we want to run), neither the error (if the command fails the
kubectl debug actually returns ok).

@@ -150,3 +149,26 @@ func FRRK8SPods(cs clientset.Interface, namespace string) ([]*corev1.Pod, error)
}
return res, nil
}

func CreatePod(cs clientset.Interface, p *corev1.Pod) (*corev1.Pod, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move these under pkg/k8s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})
})

func expectedPolicies() ([]networkingv1.NetworkPolicy, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on this being redundant, we want to ensure the behavior, not the configuration that leads to the behavior.

karampok added 4 commits June 23, 2025 09:46
I observed that the existing implementation works correctly for commands
that return instantly like whoami, ifconfig but no for commands that
take time to return and that they fail

e.g.
```
kubectl debug -it -n metallb-system -c controller-debugger-xxx
--target=controller --image=registry.k8s.io/e2e-test-images/agnhost:2.47
controller-765bbcb6fc-dzcv4 -- ./agnhost connect 10.244.1.7:8080 --timeout 5s

Targeting container "controller". If you don't see processes from this
container it may be because the container runtime doesn't support this
feature.
If you don't see a command prompt, try pressing enter.

 ---  hanging for a bit ---

TIMEOUT
Session ended, the ephemeral container will not be restarted but may be
reattached using 'kubectl attach controller-765bbcb6fc-dzcv4 -c
controller-debugger-xxx -i -t' if it is still running
```

For the commands that take time to return, current implementation is not
waiting and returns before the actual command finishes (probably due to
how TTY works), and therefore it returns neither the correct output (=
only the output from the command we want to run), neither the error (if
the command fails the kubectl debug actually returns ok).

So this commit refactors the code and follows the logic where the debug
container is a container attached to the pod, and we can do from there
normal `kubectl exec` commands without the `-it` argument and that
enables the e2etest for networkpolicies.

https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/#ephemeral-container

Signed-off-by: Konstantinos Karampogias <karampok@gmail.com>
Signed-off-by: Konstantinos Karampogias <karampok@gmail.com>
Signed-off-by: Konstantinos Karampogias <karampok@gmail.com>
Signed-off-by: Konstantinos Karampogias <karampok@gmail.com>
Copy link
Collaborator Author
@karampok karampok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fedepaol I updated according to your comments, let me know I missed something

}

if !containerExists {
pod.Spec.EphemeralContainers = append(pod.Spec.EphemeralContainers, ephemeralContainer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (if I understood your ask correctly)

@@ -150,3 +149,26 @@ func FRRK8SPods(cs clientset.Interface, namespace string) ([]*corev1.Pod, error)
}
return res, nil
}

func CreatePod(cs clientset.Interface, p *corev1.Pod) (*corev1.Pod, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the observation is that for the command that returns instantly (whoami/ifconfig), it works just fine. It
does not work fine for the commands that will timeout/exit. I re-wrote the commit so it is clear that the above sentence is only valid for the specific type of commans (which is agnhost connect --timeout).

})
})

func expectedPolicies() ([]networkingv1.NetworkPolicy, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -25,7 +25,7 @@ import (
"go.universe.tf/e2etest/pkg/status"
)

const agnostImage = "registry.k8s.io/e2e-test-images/agnhost:2.45"
const agnhostImage = "registry.k8s.io/e2e-test-images/agnhost:2.47"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it from the commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0