Set socket backlog to default size of 50 #7087

mdedetrich · 2022-12-08T10:01:07Z

My current up to date theory as to what is causing the issues with sbt scripted tests failing in Windows is actually not the method for finding ports (although that should also be improved) but rather using the non conventional backlog size of 1. The reason why I don't think it has anything to do with the port selection method is that its the sbt client that is having its connection refused, not the server when binding the port (in other words the server is not having any issues opening a port even if its calculated by random number generation). This is also ontop of the fact that I have been unable at all to reproduce the windows-latest CI image failing with the proper way of finding a free port.

The reasoning behind my theory of it being related to backlog is that its behaviour is highly dependant on both the OS and how the OS is configured. Since the backlog is a queueing mechanism along with the fact some OS's can process requests faster than others, you can get differing behaviour (i.e. if something is processed faster than the next request). Normally with a high enough value (i.e. the default which is 50) this is a non issue in practice however with it currently being set to 1 essentially this is locking the server/client to only have one connection at a time, if there happens to be an extra connection made and the current one is not resolved then you get the java.net.ConnectException: Connection refused: connect from the client.

One of the differences between OS's which I have been able to deterministically confirm is that for a backlog value of x, Windows actually only is able to process x - 1 connections at once where as *nix systems can process x number of connections but ONLY for very low values of x (i.e. 1 which is what it is currently). This alone doesn't explain the scripted test failures because on my Windows desktop the sbt-github-actions scripted tests still pass however if we account for the fact that the windows-latest CI machines are much slower and/or they are configured differently plus the recursive createServer call filling up the backlog this can then explain why scripted tests are failing. So with all of this combined this is my currently most plausible explanation as to why the sbt scripted tests are just failing on windows-latest CI image, you can see the result of these findings at #7084.

Unfortunately testing this conclusively is very hard, from what I can tell there aren't any scripted tests in this sbt project and making some sbt-plugin use a modified version of sbt to test in the windows-latest CI image is quite painful/annoying. Hence my current thinking is to set the backlog to the default value (which is 50) as its a very safe change, furthermore I checked the git log and I can't see any evidence of there being a rational reason as to why it was set to 1 in the first place (i.e. it was originally 1 and has never changed since so I think it was just an oversight). The easiest way to confirm this theory would be to make the backlog value configurable so that it can be configured in a project such as https://github.com/sbt/sbt-github-actions making the testing/verification easier however I presume that making the backlog value configurable can be regarded as excessive/overkill?

@eed3si9n Let me know what you think.

References: #7082

eed3si9n · 2022-12-08T16:47:21Z

Thanks for this!

sbt/sbt uses its own variant of scripted plugin that uses a freshly built instance and runs it as a function. I'm guessing that the communication would still work the same way to repro the issue?
The tests are under sbt-app/src/sbt-test

As per backlog is concerned it kind of makes sense that it was set to 1 because unlike today's sbt server, scripted was designed to be 1 on 1 protocol, and it wanted to refuse any second client connecting in.

mdedetrich · 2022-12-08T17:55:57Z

sbt/sbt uses its own variant of scripted plugin that uses a freshly built instance and runs it as a function. I'm guessing that the communication would still work the same way to repro the issue?
The tests are under sbt-app/src/sbt-test

I think it should work, I will try and see if I can replicate this properly using a scripted test.

As per backlog is concerned it kind of makes sense that it was set to 1 because unlike today's sbt server, scripted was designed to be 1 on 1 protocol, and it wanted to refuse any second client connecting in.

If we wan't to keep the backlog of being size 1 then the workaround in the reproduction at #7084, i.e.

val backlog =
  if (sys.props("os.name").toLowerCase(java.util.Locale.ROOT).contains("windows"))
    2
  else
    1

Might be enough to solve the problem, but to be clear the backlog isn't a concurrency limit, its actually a queue limit so it can also apply to many incoming connections from a single client. If we truly want to enforce a single client connecting to a single server there are probably smarter ways to do this (i.e. a basic lock).

eed3si9n · 2022-12-08T18:11:42Z

Might be enough to solve the problem, but to be clear the backlog isn't a concurrency limit

Yea. You're right about that. If it fixes this issue, I'd be happy to bump the number up to 50 or whatever the default is.

mdedetrich · 2022-12-14T09:15:49Z

I did a bit more work on this, I setup Windows Server 2022 (which is whats used in the windows-latest runner) via a VM on my desktop to see if I can replicate the error and there wasn't any success here. For whatever reason it seems its very specific to the runner and not windows.

ptrdom · 2023-09-10T09:42:02Z

@eed3si9n Should we finally merge this PR? As I understand, it won't hurt and might actually fix the issue.

eed3si9n · 2023-09-10T15:59:37Z

@ptrdom As far as I remember I don't think this fixed anything, and I actually am not sure if it's harmless to allow multiple concurrent connections to the server since typically we're dealing with one scripted client. Suppose there's an issue of Windows client failing to close the client instantly or something, increasing the concurrency may delay the issue, but ultimately it will fail after some time?

ptrdom · 2023-09-11T07:34:28Z

I have a feeling that the issue with slow Windows client is it is just a smidge too slow to close connection and free up the queue, and a subsequent connection request then fails with no retries. My scripted test that fails has multiple lines of sbt commands, I have a feeling if I just concatenated them into one it would start passing, because it always fails further down the line, and I can see it failing just after set commands, which I guess are relatively fast and connection is short-lived.

Could we at least make the backlog - and maybe port ranges - configurable so that users have options to play around with?

mdedetrich · 2023-09-11T07:39:41Z

I wanted to chime in and say that I agree with @eed3si9n in that while I definitely want this to get through, I have an aversion to merging the PR until we are really sure that we know what the root cause is

Could we at least make the backlog - and maybe port ranges - configurable so that users have options to play around with?

Making the backlog configurable for me seems entirely appropriate. Regarding port ranges, personally I would just change the code so that it either generates random ports after 65536 ( or w/e the number is), another arguably better solution would be to use new ServerSocket(0).getLocalPort() to get a free port (which should handle any OS correctly). Reason behind this thinking is that I don't think making the port range generation should be a configurable knob, i.e. it should either work or not on every platform and adding such a config means removing it later is problematic.

There is also the point that @eed3si9n raised before regarding concurrency, while I don't think its a problem because of what I said earlier, if someone configures the backlog to be greater than 1 we may wan't to verify that we only have a single client connecting by using a lock but this may be excessive.

My scripted test that fails has multiple lines of sbt commands, I have a feeling if I just concatenated them into one it would start passing, because it always fails further down the line, and I can see it failing just after set commands, which I guess are relatively fast and connection is short-lived.

Nice work! This probably explains why I had troubles replicating it as a sbt scripted test, I was trying to minimize the issue but if this is the core issue than the attempt at minimization would have avoided this problem entirely.

ptrdom · 2023-09-11T08:31:40Z

Regarding port ranges, personally I would just change the code so that it either generates random ports after 65536 ( or w/e the number is), another arguably better solution would be to use new ServerSocket(0).getLocalPort() to get a free port (which should handle any OS correctly). Reason behind this thinking is that I don't think making the port range generation should be a configurable knob, i.e. it should either work or not on every platform and adding such a config means removing it later is problematic.

Maybe we could make the default implementation do new ServerSocket(0), but still have configurable ranges available as an alternative? Because I do have anecdotal experience of new ServerSocket(0) not working correctly in certain edge cases - very vague memories, I do not have proof available now. Would be neat to make the implementation as flexible as possible.

mdedetrich · 2023-09-11T08:46:39Z

This is news to me, I never have had a case of

but still have configurable ranges available as an alternative? Because I do have anecdotal experience of new ServerSocket(0) not working correctly in certain edge cases

This is news to me, I haven't ever experienced such cases but yeah I would just defer to what @eed3si9n suggests

ptrdom · 2023-09-12T15:15:02Z

I started to have doubts about my assumptions, so I added sbt as git submodule to my failing project so I can easily test changes. I will try re-running different changes multiple times, until it hopefully can be confirmed what is up.
Weird thing I can observe now with testing clamped port range is that client creation keeps getting executed twice in quick succession - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6161244852/job/16720046629?pr=12.
You can see double log lines like this all over the scripted execution:

Tue, 12 Sep 2023 15:13:33 GMT
[info] creating client on port [8032]
Tue, 12 Sep 2023 15:13:33 GMT
[info] creating client on port [8032]

ptrdom · 2023-09-12T15:20:42Z

And here is the error in question - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6161244852/job/16720046629?pr=12#step:10:490. But somehow sbt managed to recover from it and continue execution instantly.

ptrdom · 2023-09-12T16:11:05Z

@ptrdom As far as I remember I don't think this fixed anything, and I actually am not sure if it's harmless to allow multiple concurrent connections to the server since typically we're dealing with one scripted client. Suppose there's an issue of Windows client failing to close the client instantly or something, increasing the concurrency may delay the issue, but ultimately it will fail after some time?

@eed3si9n Just remembered - backlog is not about concurrent established connections, but concurrent incoming connections, if I understand the docs correctly. This means that these double invocations in quick succession can definitely trip the backlog of 1 and it is not a concurrency mechanism that replaces locking.

eed3si9n · 2023-09-12T19:21:05Z

@ptrdom sort of, yes and no.

https://docs.oracle.com/javase/8/docs/api/java/net/ServerSocket.html#ServerSocket-int-int-

The maximum queue length for incoming connection indications (a request to connect) is set to the backlog parameter. If a connection indication arrives when the queue is full, the connection is refused.

I don't know if using analogy would help or be more confusing, but we're trying to make a restaurant with one seating, currently instead of counting the number of customers, we have a small door enough to fit exactly one person. Windows keeps failing because for some reason two people are trying to get through the small door. Making 50 doors, and then subsequently telling second customers to go away might be a more technical approach, but I don't know if that would necessarily help the Windows situation.

ptrdom · 2023-09-13T08:39:15Z

@eed3si9n Exactly, very good analogy!

I did some testing with these changes - 1.9.x...ptrdom:sbt:scripted-fix - on my failing project - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6163002707/job/16744233698 - and I can confidently say that increasing backlog helped - before it used to fail very consistently, now it never fails. If you need more proof I can think of ways to provide it, but I am very sure that backlog of 1 is the problem. I am guessing it is simply tripping up slow CI machines.

One interesting thing I observed that clamping port range to 8000-9000 did not prevent connection errors from happening, but execution recovered and continued ahead from that error, where what would happen in high 10000+ ports was basically a deadlock - connection error would be thrown and CI instance would be locked up for hours.

I suggest we re-target this PR to current main branch (1.9.x ?) and merge it. Is that okay?

eed3si9n · 2023-09-13T13:39:36Z

I suggest we re-target this PR to current main branch (1.9.x ?) and merge it. Is that okay?

sounds good

ptrdom · 2023-09-13T13:48:25Z

@mdedetrich Could you re-target this PR to 1.9.x?

eed3si9n · 2023-09-13T15:08:11Z

Looks like 1.8.x and 1.9.x was close enough to switch automatically.

mdedetrich · 2023-09-13T15:57:01Z

@eed3si9n Thanks for merging this! One thing that i have a concern about is that I don't think the IPCSpec test that was added in this PR is actually testing the usecase properly, i.e. it doesn't test the regression on Windows server if the backlog is set to 1.

Relying on my memory, but I think its a leftover of when I was trying to make a replicating test and I can't remember if I succeeded or not.

ptrdom · 2023-09-13T18:13:24Z

True, that test is not really testing what it says it tests. We probably should just remove it.

mdedetrich · 2023-09-14T07:06:05Z

@ptrdom So at least on my end, I don't think this changed actually solved the underlying issue (see sbt/sbt-github-actions#163 and https://github.com/sbt/sbt-github-actions/actions/runs/6181923346/job/16780712199?pr=163)

ptrdom · 2023-09-14T07:25:21Z

@mdedetrich I see your scripted tests are running on sbt 1.5.5 - https://github.com/sbt/sbt-github-actions/actions/runs/6181923346/job/16780712199?pr=163#step:8:188, this is where it is set https://github.com/sbt/sbt-github-actions/blob/0bf2ef000df2d2d275b002d79617ed1083cadb18/build.sbt#L54.

mdedetrich · 2023-09-14T07:26:05Z

Ah right, always forget to set that. Gimme a sec

mdedetrich · 2023-09-14T08:28:42Z

@mdedetrich I see your scripted tests are running on sbt 1.5.5 - https://github.com/sbt/sbt-github-actions/actions/runs/6181923346/job/16780712199?pr=163#step:8:188, this is where it is set https://github.com/sbt/sbt-github-actions/blob/0bf2ef000df2d2d275b002d79617ed1083cadb18/build.sbt#L54.

Yes this was indeed it, but also need to set the sbt version in project/build.properties in the various sbt scripted tests

ptrdom · 2023-09-14T08:33:48Z

Yes this was indeed it, but also need to set the sbt version in project/build.properties in the various sbt scripted tests

That should not be necessary. You can check the project setup here - https://github.com/ptrdom/scalajs-esbuild, there is only one place where I am setting the sbt version. AFAIK, setting pluginCrossBuild / sbtVersion only makes sense if you are cross-building between sbt 0.x and 1.x.

mdedetrich · 2023-09-14T08:36:45Z

AFAIK, setting pluginCrossBuild / sbtVersion only makes sense if you are cross-building between sbt 0.x and 1.x.

So without setting pluginCrossBuild / sbtVersion in build.sbt and only setting the sbt version in project/build.properties for the various scripted tests (along with sbt version of the project itself, i.e. root build.sbt) it still fails. I think the issue is that you need to do both, pluginCrossBuild / sbtVersion is needed so that the host sbt is running on 1.9.5. and project/build.properties for the various scripted tests.

Let me diagnose a bit more

mdedetrich · 2023-09-14T10:46:04Z

@ptrdom So I did a lot of testing and for some reason I am forced to set pluginCrossBuild / sbtVersion to 1.9.5. I did forget to put a build.properties in project/project/build.properties (we have a nested sbt project there) but this alone didn't entirely solve the problem.

Interestingly the CI is only failing on graalvm (see https://github.com/sbt/sbt-github-actions/actions/runs/6184038952/job/16786950966) but this may just be a case of the non graalvm github action passing in the rare chance it does.

ptrdom · 2023-10-27T15:29:20Z

I think as side effect of this change is this error - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6668934885/job/18125613525?pr=43#step:9:544. It happens more rarely, but does happen. Probably what happens is that sbt is a bit slow to shutdown before next test starts. Would be neat to find a reliable way to investigate this, somehow emulating a behavior of a slow CI machine.

ptrdom · 2024-02-27T19:40:09Z

I think as side effect of this change is this error - https://github.com/ptrdom/scalajs-esbuild/actions/runs/6668934885/job/18125613525?pr=43#step:9:544. It happens more rarely, but does happen. Probably what happens is that sbt is a bit slow to shutdown before next test starts. Would be neat to find a reliable way to investigate this, somehow emulating a behavior of a slow CI machine.

Since I just had it happen again and the logs of my previously reported job have expired, I will post the error instead:

[info] sbt thinks that server is already booting because of this exception:
Error:  sbt.internal.ServerAlreadyBootingException: java.io.IOException: Could not create lock for \\.\pipe\sbt-load7969190881241306384_lock, error 1336
Error:  	at sbt.internal.BootServerSocket.newSocket(BootServerSocket.java:357)
Error:  	at sbt.internal.BootServerSocket.<init>(BootServerSocket.java:296)
Error:  	at sbt.xMain$.getSocketOrExit(Main.scala:152)
Error:  	at sbt.xMain$.bootServerSocket$lzycompute$1(Main.scala:78)
Error:  	at sbt.xMain$.bootServerSocket$1(Main.scala:78)
Error:  	at sbt.xMain$.withStreams$1(Main.scala:86)
Error:  	at sbt.xMain$.run(Main.scala:123)
Error:  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Error:  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Error:  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Error:  	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
Error:  	at sbt.internal.XMainConfiguration.run(XMainConfiguration.java:59)
Error:  	at sbt.xMain.run(Main.scala:47)
Error:  	at xsbt.boot.Launch$.$anonfun$run$1(Launch.scala:149)
Error:  	at xsbt.boot.Launch$.withContextLoader(Launch.scala:176)
Error:  	at xsbt.boot.Launch$.run(Launch.scala:149)
Error:  	at xsbt.boot.Launch$.$anonfun$apply$1(Launch.scala:44)
Error:  	at xsbt.boot.Launch$.launch(Launch.scala:159)
Error:  	at xsbt.boot.Launch$.apply(Launch.scala:44)
Error:  	at xsbt.boot.Launch$.apply(Launch.scala:21)
Error:  	at xsbt.boot.Boot$.runImpl(Boot.scala:78)
Error:  	at xsbt.boot.Boot$.run(Boot.scala:73)
Error:  	at xsbt.boot.Boot$.main(Boot.scala:21)
Error:  	at xsbt.boot.Boot.main(Boot.scala)
Error:  Caused by: java.io.IOException: Could not create lock for \\.\pipe\sbt-load7969190881241306384_lock, error 1336
Error:  	at org.scalasbt.ipcsocket.Win32NamedPipeServerSocket.<init>(Win32NamedPipeServerSocket.java:129)
Error:  	at org.scalasbt.ipcsocket.Win32NamedPipeServerSocket.<init>(Win32NamedPipeServerSocket.java:48)
Error:  	at sbt.internal.BootServerSocket.newSocket(BootServerSocket.java:351)
Error:  	... 23 more

ptrdom · 2024-02-27T19:42:38Z

And I should probably post all this on a new issue rather than an already merged PR 😅

Set socket backlog to default size of 50

6611ccf

mdedetrich force-pushed the increase-socket-backlog-for-server-client branch from 6c9a29b to 6611ccf Compare December 8, 2022 10:21

mdedetrich mentioned this pull request Dec 8, 2022

Use better method to find free port #7083

Draft

healchow approved these changes Dec 29, 2022

View reviewed changes

eed3si9n changed the base branch from 1.8.x to 1.9.x September 13, 2023 15:06

eed3si9n approved these changes Sep 13, 2023

View reviewed changes

eed3si9n merged commit f7025ef into sbt:1.9.x Sep 13, 2023

eed3si9n modified the milestones: 1.10.0, 1.9.5 Sep 13, 2023

mdedetrich deleted the increase-socket-backlog-for-server-client branch September 13, 2023 15:54

This was referenced Sep 13, 2023

Remove invalid test #7378

Merged

Update sbt and add windows back to CI sbt/sbt-github-actions#163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set socket backlog to default size of 50 #7087

Set socket backlog to default size of 50 #7087

Set socket backlog to default size of 50 #7087

Set socket backlog to default size of 50 #7087

Conversation