Description
What happened:
I was investigating kubernetes-retired/kubefed#1024 and I stumbled across an issue which I believe might be a bug in Kubernetes.
I have successfully recreated this issue using some test configuration, so you don't need to deploy kubefed to reproduce it.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: "codeclou/docker-nginx-self-signed-ssl:latest"
imagePullPolicy: Always
ports:
- containerPort: 4443
--
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
selector:
app: nginx
ports:
- port: 443
targetPort: 4443
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
creationTimestamp: null
name: testconfigs.example.io
spec:
group: example.io
version: v1
versions:
- name: v1
storage: true
served: true
names:
kind: TestConfig
plural: testconfigs
scope: Namespaced
validation:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
TestString:
description: This is a test string
type: string
---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
name: test-webhook
webhooks:
- name: testconfigs.example.io
clientConfig:
service:
namespace: default
name: nginx
rules:
- operations:
- CREATE
- UPDATE
apiGroups:
- example.io
apiVersions:
- v1
resources:
- testconfigs
failurePolicy: Fail
After applying the YAML, you can see that a pod gets created:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-5cf877cd99-n4n9l 1/1 Running 0 9m52s 10.156.17.45 gke-simons-cluster-preemptible-899b51b7-m0zk <none> <none>
As well as a service:
❯ kubectl get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
nginx ClusterIP 10.156.32.6 <none> 443/TCP 10m app=nginx
The other resources to be deployed are:
- a simple generic CRD of
kind: TestConfig
and - an admission validation webhook
test-webhook
.
The webhook is invoked when a new resource of kind:TestConfig
is created or updated. I would expect it to work as follows: an HTTPS request should be made made to service.name: nginx
which validates the request and the object creation succeeds.
Let's attempt to create a TestConfig object:
---
apiVersion: example.io/v1
kind: TestConfig
metadata:
name: test-object
namespace: default
spec:
TestString: "This is my test string"
I observe that the validation webhook request times out, causing the resource creation to fail.
❯ kubectl apply -f test-resource.yaml
Error from server (Timeout): error when creating "test-resource.yaml": Timeout: request did not complete within requested timeout 30s
After some investigation, I realised that the admission control webhook attempts to hit the pod's IP address (10.156.17.45), rather than the service's (10.156.32.6), on the port specified by the service's targetPort
(4443). This packet is intercepted by my GCE VPC firewall and gets denied.
{
insertId: "a05tpmfdwoniw"
jsonPayload: {
connection: {
dest_ip: "10.156.17.45"
dest_port: 4443
protocol: 6
src_ip: "10.172.0.3"
src_port: 57446
}
disposition: "DENIED"
...
GKE operates the master plane on a separate VPC network in a separate GCE account. A firewall rule is deployed automatically during cluster creation to allow traffic between the master and the node pools on ports 443 and 10250 only.
As soon as I add port 4443 to this firewall rule and the request to the pod succeeds, I observe that the admission control webhooks fires a correct request to the service name on the correct port and the validation webhook succeeds (in my specific case, I see a certificate mismatch error since no webhook CA was configured and I used a test container).
❯ kubectl apply -f test-resource.yaml
Error from server (InternalError): error when creating "test-resource.yaml": Internal error occurred: failed calling webhook "testconfigs.example.io": Post https://nginx.default.svc:443/?timeout=30s: x509: certificate is valid for local.codeclou.io, not nginx.default.svc
When I remove port 4443 from the above-mentioned GCP firewall rule, the timeout issue doesn't represent itself.
What you expected to happen:
I didn't expect any communication to happen between the admission controller webhook (on the master network) and a specific pod's IP address.
The only communication I expect to happen should be between the webhook and the service.
How to reproduce it (as minimally and precisely as possible):
Please look above to reproduce the issue or attempt to deploy the latest kubefed
on a kubernetes cluster where the traffic between the master and the node networks are restricted with only ports 443 and 10250 allowed.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-20T04:49:16Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-gke.8", Git
- Cloud provider or hardware configuration:
GKE "v1.13.7-gke.8 (latest at the time of writing) - OS (e.g:
cat /etc/os-release
): Container optimised-OS - Kernel (e.g.
uname -a
): Google managed - Install tools: Google managed
- Network plugin and version (if this is a network-related bug): Google managed
- Others:
/sig api-machinery