Description
What version of Knative?
Knative v1.17
Go 1.23.x
Expected Behavior
Requests can time out and not cause the activator to crash
Actual Behavior
The activator may run into a panic and crash when a request is running into a timeout
Investigation
This seems to be due to a race condition in the activator when requests run into a timeout.
The last log line in this case may e.g. be error reverse proxying request
printed from the knative error handler that is passed to the reserve proxy handler.
This is then followed by a panic and crashing activator container due to: fatal error: concurrent map read and map write
. The map in question is the header map of the HTTP response writer.
Stacktrace from the crashing go routine in golang net/http:
fatal error: concurrent map read and map write
goroutine 28502346 [running]:
net/http.(*chunkWriter).writeHeader(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
net/http/server.go:1493 +0x8f3
net/http.(*chunkWriter).Write(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc01bd772c0)
bufio/bufio.go:639 +0x55
net/http.(*response).finishRequest(0xc041e3a2a0)
net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc028e0d5f0, {0x24d3d58, 0xc01e1ece70})
net/http/server.go:2098 +0x615
created by net/http.(*Server).Serve in goroutine 1591
net/http/server.go:3360 +0x485
On timeout: in pkg/http/handler/timeout.go the timeout handler is:
- racing with the inner reverse proxy handler to write an error to the HTTP response
- cancel the context of the inner reverse proxy handler
- return
After the timeout handler returns the inner reserve proxy handler is still processing the canceled context and will continue to write headers of the HTTP response.
The headers are of type map[string]string
and are read/written to without any synchronization.
The timeout handler uses a lock to synchronizes writes with the inner handler, but that does not cover the header map.
In our above error case the inner reverse proxy handler received the cancel and then called the knative error handler, because the context was canceled https://github.com/knative/pkg/blob/a877090f011ffdff7227c436d9553d7ca4699bc1/network/error_handler.go#L30. This called https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L2336 which does the actual header map write.
The error handler will delete elements from and add elements to the header map of the HTTP response while the timeout handler itself is returning to https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L1699 from which it was spawned upon receiving a request, which is then trying to read the header map when finishing the request. This then may trigger a panic and subsequent crash of the activator.
If the activator received a response at the same time as the timeout occurred the inner reverse proxy handler may also
still be doing concurrent operations on the header map e.g.: https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/httputil/reverseproxy.go#L518
This is crashing our activators once every few days.
Steps to Reproduce the Problem
Difficult since it's a race condition. In theory, the race condition can likely occur any time a request times out.