8000 Dangerous behavior when reconciling LoadBalancer type Services · Issue #1650 · Altinity/clickhouse-operator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Dangerous behavior when reconciling LoadBalancer type Services #1650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Cluas opened this issue Mar 4, 2025 · 0 comments
Open

Dangerous behavior when reconciling LoadBalancer type Services #1650

Cluas opened this issue Mar 4, 2025 · 0 comments

Comments

@Cluas
Copy link
Cluas commented Mar 4, 2025

Description

There is a potentially dangerous behavior in the current implementation of reconcileService function. When a Service update fails (which could happen due to temporary webhook connection issues or other transient errors), the code attempts to delete and recreate the Service:

if err != nil {
	if apiErrors.IsNotFound(err) {
		// The Service is either not found or not updated. Try to recreate it
		w.a.V(1).M(cr).F().Info("Service: %s not found. err: %v", util.NamespaceNameString(service), err)
	} else {
		// The Service is either not found or not updated. Try to recreate it
		w.a.WithEvent(cr, a.EventActionUpdate, a.EventReasonUpdateFailed).
			WithAction(cr).
			WithError(cr).
			M(cr).F().
			Error("Update Service: %s failed with error: %v", util.NamespaceNameString(service), err)
	}

	_ = w.c.deleteServiceIfExists(ctx, service.GetNamespace(), service.GetName())
	err = w.createService(ctx, cr, service)
}

This approach is particularly dangerous for Services of type LoadBalancer. When such a Service is deleted and recreated:

  1. The previously allocated external IP address is released
  2. A new LoadBalancer is provisioned with a different IP address
  3. All client connections to the previous IP address are broken
  4. DNS records need to be updated, causing potential service disruption
  5. Applications relying on stable endpoints will fail

Impact

This can cause significant production outages in environments where:

  • Applications depend on stable LoadBalancer IP addresses
  • External DNS records point to these LoadBalancer IPs
  • Client applications have cached the IP addresses

Proposed Solution

The code should handle LoadBalancer type Services differently from other Service types. Instead of deleting and recreating them on update failure, it should:

  1. Check the Service type before attempting recreation
  2. For LoadBalancer Services, log the error but preserve the existing Service
  3. Only perform delete-and-recreate for other Service types (ClusterIP, NodePort)

Example implementation:

if err != nil {
    // The Service is either not found or not updated.
    // For LoadBalancer type services, we should be careful not to delete and recreate
    // as this would cause the LoadBalancer IP to change
    if service.Spec.Type == core.ServiceTypeLoadBalancer {
        w.a.WithEvent(chi, eventActionReconcile, eventReasonReconcileFailed).
            WithStatusAction(chi).
            WithStatusError(chi).
            M(chi).F().
            Error("FAILED to update LoadBalancer Service: %s CHI: %s. NOT deleting to preserve external IP", service.Name, chi.Name)
        return err
    }

    // For other service types, we can try to recreate it
    ...
}

This change would prevent unexpected IP address changes for LoadBalancer Services while still allowing recreation of other Service types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0