Dangerous behavior when reconciling LoadBalancer type Services · Issue #1650 · Altinity/clickhouse-operator · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a potentially dangerous behavior in the current implementation of reconcileService function. When a Service update fails (which could happen due to temporary webhook connection issues or other transient errors), the code attempts to delete and recreate the Service:
iferr!=nil {
ifapiErrors.IsNotFound(err) {
// The Service is either not found or not updated. Try to recreate itw.a.V(1).M(cr).F().Info("Service: %s not found. err: %v", util.NamespaceNameString(service), err)
} else {
// The Service is either not found or not updated. Try to recreate itw.a.WithEvent(cr, a.EventActionUpdate, a.EventReasonUpdateFailed).
WithAction(cr).
WithError(cr).
M(cr).F().
Error("Update Service: %s failed with error: %v", util.NamespaceNameString(service), err)
}
_=w.c.deleteServiceIfExists(ctx, service.GetNamespace(), service.GetName())
err=w.createService(ctx, cr, service)
}
This approach is particularly dangerous for Services of type LoadBalancer. When such a Service is deleted and recreated:
The previously allocated external IP address is released
A new LoadBalancer is provisioned with a different IP address
All client connections to the previous IP address are broken
DNS records need to be updated, causing potential service disruption
Applications relying on stable endpoints will fail
Impact
This can cause significant production outages in environments where:
Applications depend on stable LoadBalancer IP addresses
External DNS records point to these LoadBalancer IPs
Client applications have cached the IP addresses
Proposed Solution
The code should handle LoadBalancer type Services differently from other Service types. Instead of deleting and recreating them on update failure, it should:
Check the Service type before attempting recreation
For LoadBalancer Services, log the error but preserve the existing Service
Only perform delete-and-recreate for other Service types (ClusterIP, NodePort)
Example implementation:
iferr!=nil {
// The Service is either not found or not updated.// For LoadBalancer type services, we should be careful not to delete and recreate// as this would cause the LoadBalancer IP to changeifservice.Spec.Type==core.ServiceTypeLoadBalancer {
w.a.WithEvent(chi, eventActionReconcile, eventReasonReconcileFailed).
WithStatusAction(chi).
WithStatusError(chi).
M(chi).F().
Error("FAILED to update LoadBalancer Service: %s CHI: %s. NOT deleting to preserve external IP", service.Name, chi.Name)
returnerr
}
// For other service types, we can try to recreate it...
}
This change would prevent unexpected IP address changes for LoadBalancer Services while still allowing recreation of other Service types.
The text was updated successfully, but these errors were encountered:
Description
There is a potentially dangerous behavior in the current implementation of
reconcileService
function. When a Service update fails (which could happen due to temporary webhook connection issues or other transient errors), the code attempts to delete and recreate the Service:This approach is particularly dangerous for Services of type
LoadBalancer
. When such a Service is deleted and recreated:Impact
This can cause significant production outages in environments where:
Proposed Solution
The code should handle LoadBalancer type Services differently from other Service types. Instead of deleting and recreating them on update failure, it should:
Example implementation:
This change would prevent unexpected IP address changes for LoadBalancer Services while still allowing recreation of other Service types.
The text was updated successfully, but these errors were encountered: