8000 Activator sometimes crashing when requests time out · Issue #15850 · knative/serving · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Activator sometimes crashing when requests time out #15850
Open
@norman465

Description

@norman465

What version of Knative?

Knative v1.17
Go 1.23.x

Expected Behavior

Requests can time out and not cause the activator to crash

Actual Behavior

The activator may run into a panic and crash when a request is running into a timeout

Investigation

This seems to be due to a race condition in the activator when requests run into a timeout.

The last log line in this case may e.g. be error reverse proxying request printed from the knative error handler that is passed to the reserve proxy handler.
This is then followed by a panic and crashing activator container due to: fatal error: concurrent map read and map write. The map in question is the header map of the HTTP response writer.

Stacktrace from the crashing go routine in golang net/http:

fatal error: concurrent map read and map write

goroutine 28502346 [running]:
net/http.(*chunkWriter).writeHeader(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
	net/http/server.go:1493 +0x8f3
net/http.(*chunkWriter).Write(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
	net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc01bd772c0)
	bufio/bufio.go:639 +0x55
net/http.(*response).finishRequest(0xc041e3a2a0)
	net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc028e0d5f0, {0x24d3d58, 0xc01e1ece70})
	net/http/server.go:2098 +0x615
created by net/http.(*Server).Serve in goroutine 1591
	net/http/server.go:3360 +0x485

On timeout: in pkg/http/handler/timeout.go the timeout handler is:

  1. racing with the inner reverse proxy handler to write an error to the HTTP response
  2. cancel the context of the inner reverse proxy handler
  3. return

After the timeout handler returns the inner reserve proxy handler is still processing the canceled context and will continue to write headers of the HTTP response.
The headers are of type map[string]string and are read/written to without any synchronization.
The timeout handler uses a lock to synchronizes writes with the inner handler, but that does not cover the header map.

In our above error case the inner reverse proxy handler received the cancel and then called the knative error handler, because the context was canceled https://github.com/knative/pkg/blob/a877090f011ffdff7227c436d9553d7ca4699bc1/network/error_handler.go#L30. This called https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L2336 which does the actual header map write.
The error handler will delete elements from and add elements to the header map of the HTTP response while the timeout handler itself is returning to https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L1699 from which it was spawned upon receiving a request, which is then trying to read the header map when finishing the request. This then may trigger a panic and subsequent crash of the activator.

If the activator received a response at the same time as the timeout occurred the inner reverse proxy handler may also
still be doing concurrent operations on the header map e.g.: https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/httputil/reverseproxy.go#L518

This is crashing our activators once every few days.

Steps to Reproduce the Problem

Difficult since it's a race condition. In theory, the race condition can likely occur any time a request times out.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIssues which should be fixed (post-triage)

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0