8000 socket-connect for CCL causing subsequent "Bad File Descriptor during Read" on large files · Issue #103 · usocket/usocket · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

socket-connect for CCL causing subsequent "Bad File Descriptor during Read" on large files #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gendl opened this issue Apr 26, 2023 · 15 comments
Assignees

Comments

@gendl
Copy link
gendl commented Apr 26, 2023

Hi,

In upgrading from 0.8.3 to 0.8.6 I start to get errors on CCL in my test suite where it happens to do several http client calls (with zaserve client or drakma) then open and try to read a large file immediately.

The problem is isolated to function socket-connect from backend/clozure.lisp.

First, download and place the huge-boxes-sequence.data file in a known location (or see below for a way to run from a container launched from a script in a clone of the Gendl repository, which avoids the need to download this file manually).

Here is where you can download the huge-boxes-sequence file I've been testing with.

Now replicate as follows:

(load-quicklisp) ;; if needed; this works in the container described below. 
;; otherwise use whatever is your standard way to bootstrap quicklisp

(ql:quickload :drakma)

(defun replicate ()
 (make-web-request) (make-web-request)(make-web-request)(make-web-request)
 ;; (sleep 1)  ;; if this sleep it put here the error does not happen.
 (read-large-file) nil)

(defun make-web-request ()
 (drakma:http-request "http://google.com"))

(defun read-large-file ()
 (with-open-file (in ".../huge-boxes-sequence.data")
   (read in)))

Then call

(replicate)

If your host does not have ipv6 connectivity, you will likely get the "Bad File Descriptor during Read"

I think it may be leaving a dangling openmcl socket when it tries to call openmcl-socket:make-socket and fails when trying with ipv6 (as it now does explicitly all the time).

As noted in the code above, the error only happens if a large file read is attempted immediately after the http client call. It only happens on hosts where the resolver will give a ipv6 address for e.g. "google.com" but then there is no ipv6 connectivity to that host. See below for a quick way to test whether this is the case in your setup or not.

I wonder if there is something the CCL socket code is not cleaning up, when a call to openmcl-socket:make-socket is tried and fails with an error. Or does it have to do with the :deadline :nodelay or :connect-timeout

Anyway i can look into this further but it's late here now so I thought i'd post the simple example for replicating, in case anyone cares to try replicating.

Thanks,

Dave

@gendl
Copy link
Author
gendl commented Apr 26, 2023

Note that to replicate successfully, you probably need to not have an active ipv6 net connection. The resolver for e.g. google.com will return at least one ipv6 address, but then if you try to open an active socket on that address but you don't have a net interface to support ipv6, the make-socket will fail and that's when you can replicate the bug.

@binghe binghe self-assigned this Apr 28, 2023
@easye
Copy link
easye commented Apr 28, 2023

@gendl I can't seem to replicate, but not sure about two points:

  1. What is a large file? Is 1Mib of bytes enough?

  2. I don't seem to get a "bad file descriptor during read", but rather I get a

Unexpected end of file on #<BASIC-FILE-CHARACTER-INPUT-STREAM ("/home/mevenson/tmp/xx"/\
4 U
8000
TF-8) #x30200233F7DD>, near position 1048576  

which for a 1Mib files of zeros created via dd if=/dev/zero of=~/tmp/xx bs=1M count=1 , this makes sense as an error.

@gendl
Copy link
Author
gendl commented Apr 28, 2023

@easye Here is where you can download the actual file I've been testing with (Edited the Issue Description above to include this info also)

Did you disable ipv6 support in your networking before trying to replicate?

@easye
Copy link
easye commented Apr 30, 2023

@easye Here is where you can download the actual file I've been testing with.

A more correct link is via

wget https://gitlab.common-lisp.net/gendl/gendl/-/raw/release/1598/regression/data/huge-boxes-sequence.data?inline=false -O huge-boxes-sequence.data to a file

https://gitlab.common-lisp.net/gendl/gendl/-/raw/release/1598/regression/data/huge-boxes-sequence.data?inline=false

Did you disable ipv6 support in your networking before trying to replicate?

@easye
Copy link
easye commented Apr 30, 2023

Did you disable ipv6 support in your networking before trying to replicate?

No. This has to be done at the link level, or is removing something from ANSI *features* enough?

@gendl
Copy link
Author
gendl commented Apr 30, 2023

@easye ipv6 would have to be disabled at the link level. It needs to be present in *features* but not working at the network level so that the ipv6 connection will be attempted and will fail, thus triggering the bug. What it's meant to do in that case is "fall back" to ipv4, which it does - and we want to retain that behavior. What we need to fix is the apparent side-effect from attempting the ipv6 connection, which appears to result in the "Bad File Descriptor" error downstream.

@gendl
Copy link
Author
gendl commented Apr 30, 2023

To be more precise - ipv6 doesn't necessarily have to be disabled globally in your whole OS - it's enough that the particular hostname you are doing the drakma call to meets two conditions: 1. It has an active AAAA record such that a dns lookup with :address-family :internet6 will return an ipv6 address, and 2. Trying to connect to said ipv6 address fails (with a "no route to host" or similar, but I'm not sure whether the reason for failure matters).

@easye
Copy link
easye commented May 2, 2023

So please don't pull out your hair trying to replicate this..

Standing down from trying to replicate. @gendl find me on IRC to chat about next steps?

@gendl
Copy link
Author
gendl commented May 2, 2023

I am now replicating reliably on any Linux CCL 1.12 I try, including WSL Ubuntu and Docker. To try it on Docker you can try the prebuilt Gendl/CCL image from dockerhub. In order to run the same version of the prebuilt Gendl image please follow these steps:

  1. git clone git@gitlab.common-lisp.net:gendl/gendl
  2. cd gendl
  3. git checkout release/1598
  4. cd docker
  5. ./run

This should start Swank listening on port 5200, to which you can attach from an emacs with a reasonably recent Slime loaded by doing M-x slime-connect [RET] and specify localhost with port 5200.

Note that the ./run script will bind-mount the gendl/ directory (i.e the parent of the docker directory where that run script lives) to /home/gendl-user/gendl/ in the container, so you can put files in there if you want to access them from the container.

Edit: Actually you can access the huge-boxes-sequence test data directly if you cloned the gendl repo - in the container, it will show up automatically as /home/gendl-user/gendl/regression/data/huge-boxes-sequence.data.

So the read-large-file function can be defined as:

(defun read-large-file ()
 (with-open-file (in "/home/gendl-user/gendl/regression/data/huge-boxes-sequence.data")
   (read in)))

Then load the code above (adjusting the path in the defun read-large-file if needed to match where you save the huge-boxes-sequence file) and run (replicate), and you should see the error.

I cannot replicate on MacOS but maybe it's because ipv6 actually works there. I will experiment and report here. @easye I'll look for you on IRC also.

@easye
Copy link
easye commented May 2, 2023

I am now replicating reliably on any Linux CCL 1.12 I try, including WSL Ubuntu and Docker.

I cannot replicate under a Linux vultr host. I'll triage the exact details, but once I do, I will chase @gendl down on IRC.

@gendl
Copy link
Author
gendl commented May 2, 2023

I have established a quick way to predict whether this error will replicate on your platform. Basically the error will happen if your host or network is not configured for ipv6, yet the resolver returns ipv6 addresses if you ask explicitly for them. Here is how to determine if this is the case on your testing platform:

(setq remote (openmcl-socket:resolve-address :host "google.com"
                                                :port 80
                                                :socket-type :stream
                                                :address-family :internet6))

Assuming the above returns an ipv6 address, now do:

(openmcl-socket:make-socket
                 :type :stream
                 :address-family :internet6
                 :remote-address remote)
                 

If that returns a socket without errors, then your setup is configured for ipv6 and has ipv6 connectivity. If not, then not, and likely you will be able to replicate the error in this Issue description.

@gendl
Copy link
Author
gendl commented May 2, 2023

Note that if you have cloned the gendl repository, you can also just run the pipeline tests directly from a shell with


cd gendl
./docker/pipeline-tests

@easye
Copy link
easye commented May 4, 2023

I have (finally) managed to replicate the "Bad File Descriptor during Read" issue on native Linux without reference to gendl.

I am now rummaging through the SLIME inspector at the point where the condition is signalled after having read 491520 of 5259205 bytes. More information when I have it…


For completeness, the "bad" patch in usocket-0.8.5 was
353b781

"bad" only because it causes the error to surface: it could well be a problem somehow in ccl as far as I can tell at this point.

@gendl
Copy link
Author
gendl commented May 18, 2023

@easye could you describe how you managed to replicate? Maybe @binghe could have a go at replicating (I only presume he might want to try that since he assigned himself here)? It might be telling that doing e.g. a (sleep 0.5) after the http client call and before the big file read, makes it so the problem doesn't show up...

@easye
Copy link
easye commented May 19, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0