-
Notifications
You must be signed in to change notification settings - Fork 951
Set socket backlog to default size of 50 #7087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set socket backlog to default size of 50 #7087
Conversation
6c9a29b
to
6611ccf
Compare
Thanks for this! sbt/sbt uses its own variant of scripted plugin that uses a freshly built instance and runs it as a function. I'm guessing that the communication would still work the same way to repro the issue? As per backlog is concerned it kind of makes sense that it was set to 1 because unlike today's sbt server, scripted was designed to be 1 on 1 protocol, and it wanted to refuse any second client connecting in. |
I think it should work, I will try and see if I can replicate this properly using a scripted test.
If we wan't to keep the backlog of being size val backlog =
if (sys.props("os.name").toLowerCase(java.util.Locale.ROOT).contains("windows"))
2
else
1 Might be enough to solve the problem, but to be clear the |
Yea. You're right about that. If it fixes this issue, I'd be happy to bump the number up to 50 or whatever the default is. |
I did a bit more work on this, I setup Windows Server 2022 (which is whats used in the |
@eed3si9n Should we finally merge this PR? As I understand, it won't hurt and might actually fix the issue. |
@ptrdom As far as I remember I don't think this fixed anything, and I actually am not sure if it's harmless to allow multiple concurrent connections to the server since typically we're dealing with one scripted client. Suppose there's an issue of Windows client failing to close the client instantly or something, increasing the concurrency may delay the issue, but ultimately it will fail after some time? |
I have a feeling that the issue with slow Windows client is it is just a smidge too slow to close connection and free up the queue, and a subsequent connection request then fails with no retries. My scripted test that fails has multiple lines of Could we at least make the backlog - and maybe port ranges - configurable so that users have options to play around with? |
I wanted to chime in and say that I agree with @eed3si9n in that while I definitely want this to get through, I have an aversion to merging the PR until we are really sure that we know what the root cause is
Making the backlog configurable for me seems entirely appropriate. Regarding port ranges, personally I would just change the code so that it either generates random ports after 65536 ( or w/e the number is), another arguably better solution would be to use There is also the point that @eed3si9n raised before regarding concurrency, while I don't think its a problem because of what I said earlier, if someone configures the backlog to be greater than 1 we may wan't to verify that we only have a single client connecting by using a lock but this may be excessive.
Nice work! This probably explains why I had troubles replicating it as a sbt scripted test, I was trying to minimize the issue but if this is the core issue than the attempt at minimization would have avoided this problem entirely. |
Maybe we could make the default implementation do |
This is news to me, I never have had a case of
This is news to me, I haven't ever experienced such cases but yeah I would just defer to what @eed3si9n suggests |
I started to have doubts about my assumptions, so I added sbt as git submodule to my failing project so I can easily test changes. I will try re-running different changes multiple times, until it hopefully can be confirmed what is up.
|
And here is the error in question - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6161244852/job/16720046629?pr=12#step:10:490. But somehow sbt managed to recover from it and continue execution instantly. |
@eed3si9n Just remembered - backlog is not about concurrent established connections, but concurrent incoming connections, if I understand the docs correctly. This means that these double invocations in quick succession can definitely trip the backlog of 1 and it is not a concurrency mechanism that replaces locking. |
@ptrdom sort of, yes and no. https://docs.oracle.com/javase/8/docs/api/java/net/ServerSocket.html#ServerSocket-int-int-
I don't know if using analogy would help or be more confusing, but we're trying to make a restaurant with one seating, currently instead of counting the number of customers, we have a small door enough to fit exactly one person. Windows keeps failing because for some reason two people are trying to get through the small door. Making 50 doors, and then subsequently telling second customers to go away might be a more technical approach, but I don't know if that would necessarily help the Windows situation. |
@eed3si9n Exactly, very good analogy! I did some testing with these changes - 1.9.x...ptrdom:sbt:scripted-fix - on my failing project - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6163002707/job/16744233698 - and I can confidently say that increasing backlog helped - before it used to fail very consistently, now it never fails. If you need more proof I can think of ways to provide it, but I am very sure that backlog of 1 is the problem. I am guessing it is simply tripping up slow CI machines. One interesting thing I observed that clamping port range to 8000-9000 did not prevent connection errors from happening, but execution recovered and continued ahead from that error, where what would happen in high 10000+ ports was basically a deadlock - connection error would be thrown and CI instance would be locked up for hours. I suggest we re-target this PR to current main branch ( |
sounds good |
@mdedetrich Could you re-target this PR to |
Looks like 1.8.x and 1.9.x was close enough to switch automatically. |
@eed3si9n Thanks for merging this! One thing that i have a concern about is that I don't think the Relying on my memory, but I think its a leftover of when I was trying to make a replicating test and I can't remember if I succeeded or not. |
True, that test is not really testing what it says it tests. We probably should just remove it. |
@ptrdom So at least on my end, I don't think this changed actually solved the underlying issue (see sbt/sbt-github-actions#163 and https://github.com/sbt/sbt-github-actions/actions/runs/6181923346/job/16780712199?pr=163) |
@mdedetrich I see your scripted tests are running on sbt |
Ah right, always forget to set that. Gimme a sec |
Yes this was indeed it, but also need to set the sbt version in |
That should not be necessary. You can check the project setup here - https://github.com/ptrdom/scalajs-esbuild, there is only one place where I am setting the sbt version. AFAIK, setting |
So without setting Let me diagnose a bit more |
@ptrdom So I did a lot of testing and for some reason I am forced to set Interestingly the CI is only failing on graalvm (see https://github.com/sbt/sbt-github-actions/actions/runs/6184038952/job/16786950966) but this may just be a case of the non graalvm github action passing in the rare chance it does. |
I think as side effect of this change is this error - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6668934885/job/18125613525?pr=43#step:9:544. It happens more rarely, but does happen. Probably what happens is that sbt is a bit slow to shutdown before next test starts. Would be neat to find a reliable way to investigate this, somehow emulating a behavior of a slow CI machine. |
Since I just had it happen again and the logs of my previously reported job have expired, I will post the error instead:
|
And I should probably post all this on a new issue rather than an already merged PR 😅 |
My current up to date theory as to what is causing the issues with sbt scripted tests failing in Windows is actually not the method for finding ports (although that should also be improved) but rather using the non conventional
backlog
size of 1. The reason why I don't think it has anything to do with the port selection method is that its the sbt client that is having its connection refused, not the server when binding the port (in other words the server is not having any issues opening a port even if its calculated by random number generation). This is also ontop of the fact that I have been unable at all to reproduce thewindows-latest
CI image failing with the proper way of finding a free port.The reasoning behind my theory of it being related to
backlog
is that its behaviour is highly dependant on both the OS and how the OS is configured. Since the backlog is a queueing mechanism along with the fact some OS's can process requests faster than others, you can get differing behaviour (i.e. if something is processed faster than the next request). Normally with a high enough value (i.e. the default which is50
) this is a non issue in practice however with it currently being set to1
essentially this is locking the server/client to only have one connection at a time, if there happens to be an extra connection made and the current one is not resolved then you get thejava.net.ConnectException: Connection refused: connect
from the client.One of the differences between OS's which I have been able to deterministically confirm is that for a
backlog
value ofx
, Windows actually only is able to processx
- 1 connections at once where as *nix systems can processx
number of connections but ONLY for very low values ofx
(i.e.1
which is what it is currently). This alone doesn't explain the scripted test failures because on my Windows desktop the sbt-github-actions scripted tests still pass however if we account for the fact that thewindows-latest
CI machines are much slower and/or they are configured differently plus the recursivecreateServer
call filling up the backlog this can then explain why scripted tests are failing. So with all of this combined this is my currently most plausible explanation as to why the sbt scripted tests are just failing onwindows-latest
CI image, you can see the result of these findings at #7084.Unfortunately testing this conclusively is very hard, from what I can tell there aren't any scripted tests in this sbt project and making some sbt-plugin use a modified version of sbt to test in the
windows-latest
CI image is quite painful/annoying. Hence my current thinking is to set thebacklog
to the default value (which is50
) as its a very safe change, furthermore I checked the git log and I can't see any evidence of there being a rational reason as to why it was set to1
in the first place (i.e. it was originally1
and has never changed since so I think it was just an oversight). The easiest way to confirm this theory would be to make thebacklog
value configurable so that it can be configured in a project such as https://github.com/sbt/sbt-github-actions making the testing/verification easier however I presume that making thebacklog
value configurable can be regarded as excessive/overkill?@eed3si9n Let me know what you think.
References: #7082