Description
Describe the bug:
cert-manager caches the results of which zone is responsible for which domain. If this information changes, cert-manage will not learn about it until it restarts.
When a challenge is presented cert-manager attempts to find the associated zone (util.FindZoneByFqdn
). It then caches this result. However, in the case of a delegated zone that changes (for whatever reason, in my case initial misconfiguration) it can happen that the wrong zone is cached. This cache is not cleared until cert-manager restarts.
So if the zone associated with a domain changes, (e.g. because some subdomain is delegated to other NS servers), this change will not be detected until the service restarts.
Expected behaviour:
If the zone responsible for a domain changes, cert-manager should (eventually) pick this up without a restart.
Possible ways to achieve this:
- cache should be cleared after some time;
- unsuccessful challenges have their related cache items invalidated; or
- on deletion of ingress / cert, associated cache items are also invalidated
Steps to reproduce the bug:
(I assume that cert-manager is running in k8s.)
- Have 2 different DNS providers. One that manages
example.com
(Provider 1) and one that managessub.example.com
(Provider 2). - Setup an issuer with a DNS01 solver for provider 2.
- Set-up the DNS records like this:
# Records on Provider 1
NS example.com ns.provider-1.example
# Records on Provider 2
NS sub.example.com ns.provider-2.example
A my-website.sub.example.com ns.provider-2.example
Note: The above records are purposefully not correctly configured. The first provider is missing a NS sub.example.com ns.provider-2.example
record. This will be added later.
- Issue a challenge for
my-website.sub.example.com
. - Note that the resolved zone will be
example.com
and the challenge will fail because the configured issuer does not have that zone. This is expected, and correct behavior. - Now correct the misconfiguration in Provider 1 by adding the NS record delegating the subdomain:
# Record to add on Provider 1
NS sub.example.com ns.provider-2.example
- Now issue the same challenge again.
- Note that the resolved zone still goes to
example.com
, even tho it should now go tosub.example.com
. This is wrong. The challenge fails again. - Restart the cert-manager deployment
- Note that the challenge now resolves the zone correctly and the certificate gets issues as expected.
Anything else we need to know?:
In the file pkg/issuer/acme/dns/util/wait.go
, there is a function FindZoneByFqdn
that sets the cache. This is the cache that is never cleaned.
Environment details:
- Kubernetes version: 1.31.1
- Cloud-provider/provisioner: syseleven
- cert-manager version: v1.12.7
- Install method: helm
/kind bug