Description
I'm periodically seeing connection failures when trying to connect to an Aurora Serverless instance. Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems. The error messages look something like this:
failed to connect to `host=xyz user=xyz database=xyz`: dial error (dial tcp x.x.x.x:5432: i/o timeout)
We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.
The connection attempt times out after 60 seconds, which makes sense because of this line, and the error message is coming from here.
While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn
is returned. I guess what I'm wondering is would it make sense to return ErrBadConn
when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.
I'm happy to experiment with this, but I just wanted to ask first since I'm not mega familiar with Go SQL drivers.
Thanks in advance!