[DO NOT MERGE] daemon.ContainerLogs(): fix --follow #37656

kolyshkin · 2018-08-16T13:27:05Z

This is a continuation of #37576, much simplified and improved, also easier to review

This commit tires to kill three birds with one stone.

I. Goroutine leak on docker logs --follow (#37391).

When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.

If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.

Apparently, what happens is:

logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages synchronously (which is OK for now).
Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channge send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.

II. Premature exit of docker logs --follow (#37630).

docker logs --follow should mimic tail -f, which only exits once
the file it watches is gone. Same here: even if a container is
stopped, docker logs --follow should not exit since the container
can be restarted and continue producing logs. The two ways to
exit "docker logs --follow" should be:

Kill it (as in Ctrl-C or kill).
Remove the container.

III. Complicated logic of following logs.

This is not an issue per se, but followLogs() is "pretty gnarly"
(C) @cpuguy83. While we're at it, let's try to improve things a bit.

Now onto the fix.

It appears to be that LogWatcher should not be bothered with
log producer (i.e. container) being gone, so remove calls to
(*LogWatcher).Close() from the log drivers, as well as the
accompaniying data structures. This helps to solve issues II and III.
To clarify what (*LogWatcher).Close() actually is, rename it
to ConsumerGone() (note before that patch it used to mean both
ProducerGone() (which is eliminated in 1 above) and ConsumerGone().
Similarly, WatchClose() is now WatchConsumerGone(), etc.
This is not required, but improves code readability a bit.
followLogs() is modified to:

remove the context and the associated goroutine, which is no longer
needed (helps to fix II and III);
remove blocking msg send (fixes I);
watch for Chmod event (which, if the file being watched is no more,
means that the (opened) file is removed), exiting if the log file
is removed (helps to fix II);
exit once ConsumerGone() is received, freeing all the
resources (fixes I).

Some existing test cases are modified due to II being fixed (i.e. now the container
should be removed if we want docker logs -f to finish by itself).
A test case TestLogsFollowGoroutineLeak is added to check if I is fixed.
ContainerLogs() and ReadLogs() are modified to honor "follow" flag
even if the container is not running.

thaJeztah · 2018-08-16T13:38:32Z

vendor.conf

@@ -95,7 +95,7 @@ github.com/philhofer/fwd 98c11a7a6ec829d672b03833c3d69a7fae1ca972
 github.com/tinylib/msgp 3b556c64540842d4f82967be066a7f7fffc3adad

 # fsnotify
-github.com/fsnotify/fsnotify v1.4.7
+github.com/fsnotify/fsnotify c9e9bfb647855178ec5f3947c02e6bd47a379eb9 https://github.com/kolyshkin/fsnotify/


Per fsnotify/fsnotify#183 (comment) and fsnotify/fsnotify#245 (comment) the project is looking for new maintainers; perhaps we can assist in that

I'll start with a few PRs :)

codecov · 2018-08-16T14:01:34Z

Codecov Report

❗ No coverage uploaded for pull request base (master@0d9d861). Click here to learn what that means.
The diff coverage is 23.33%.

@@            Coverage Diff            @@
##             master   #37656   +/-   ##
=========================================
  Coverage          ?   36.04%           
=========================================
  Files             ?      609           
  Lines             ?    45018           
  Branches          ?        0           
=========================================
  Hits              ?    16225           
  Misses            ?    26565           
  Partials          ?     2228

kolyshkin · 2018-08-17T11:25:32Z

Windows CI failure looks legit:

02:07:42.875 FAIL: check_test.go:107: DockerSuite.TearDownTest
02:07:42.875
02:07:42.875 assertion failed: error is not nil: Error response from daemon: unable to remove filesystem for 8f40ac43ca11be91cd139bf4b75db1164ed0639140757d3c9589b3bdda5e1ba2: remove D:\CI\CI-0f0bc06c5\daemon\containers\8f40ac43ca11be91cd139bf4b75db1164ed0639140757d3c9589b3bdda5e1ba2\8f40ac43ca11be91cd139bf4b75db1164ed0639140757d3c9589b3bdda5e1ba2-json.log: The process cannot access the file because it is being used by another process.: failed to remove 8f40ac43ca11be91cd139bf4b75db1164ed0639140757d3c9589b3bdda5e1ba2

kolyshkin · 2018-08-17T17:27:25Z

docker.py test case hang is also legit; made a couple of fixes to tests: docker/docker-py#2121; will retest.

kolyshkin · 2018-08-17T19:16:25Z

z ci needs to be restarted

anusha-ragunathan · 2018-08-20T15:16:51Z

Merge conflict needs to be resolved

kolyshkin · 2018-08-23T03:56:39Z

Rebased; local log driver patched as well (to remove readers).

This code has many return statements, for some of them the "end logs" or "end stream" message was not printed, giving the impression that this "for" loop never ended. Make sure that "begin logs" is to be followed by "end logs". Signed-off-by: Kir Kolyshkin <kolyshk 8000 in@gmail.com>

This ia a test case for issue moby#37391. The idea is to start a container that produces lots of output, then run "docker logs --follow" on the above container, than kill "docker logs" and check whether the number of - daemon goroutines, - daemon opened file descriptors are back to whatever they were. Currently, this test reliably detects the leak. PS here's what it takes to run the test case against the local daemon: for i in busybox busybox:glibc hello-world debian:jessie; do docker pull $i; done cd integration ln -s ../Dockerfile . cd container go test -v -run TestLogsFollowGoroutineLeak Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

To include fsnotify/fsnotify#260 (fix for fsnotify/fsnotify#194). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

@cpuguy83

This commit kills three birds with one stone. I. Goroutine leak on docker logs --follow (moby#37391). When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channge send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. II. Premature exit of docker logs --follow (moby#37630). docker logs --follow should mimic tail -f, which only exits once the file it watches is gone. Same here: even if a container is stopped, docker logs --follow should not exit since the container can be restarted and continue producing logs. The two ways to exit "docker logs --follow" should be: 1. Kill it (as in Ctrl-C or kill). 2. Remove the container. III. Complicated logic of following logs. This is not an issue per se, but followLogs() is "pretty gnarly" (C) @cpuguy83. While we're at it, let's try to improve things a bit. Now onto the fix. 1. It appears to be that LogWatcher should not be bothered with log producer (i.e. container) being gone, so remove calls to `(*LogWatcher).Close()` from the log drivers, as well as the accompaniying data structures. This helps to solve issues II and III. 2. To clarify what `(*LogWatcher).Close()` actually is, rename it to `ConsumerGone()` (note before that patch it used to mean both `ProducerGone()` (which is eliminated in 1 above) and `ConsumerGone()`. Similarly, WatchClose() is now WatchConsumerGone(), etc. This is not required, but improves code readability a bit. 3. followLogs() is modified to - remove the context and the associated goroutine, which is no longer needed (fixes II and III); - remove blocking msg send (fixes I); - watch for Chmod event (which, if the file being watched is no more, means that the (opened) file is removed), exiting if the log file is removed (fixes II); - exit once ConsumerGone() is received, freeing all the resources (fixes I). 4. Test cases are modified due to II being fixed (i.e. now the container should be removed if we want "docker logs -f" to finish by itself). Should fix moby#37391 and moby#37630. [v2: fix conflicts after local driver merge, patch local driver] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

TestLogsFollowNonStop is a test case for moby#3763. It checsk that ContainerLogs(opts.Follow=true): - won't stop even if the container is stopped; - keep reading logs once the container is restarted; - only stops when the container is removed. TestLogsFollowStopped checks that ContainerLogs(opts.Follow=true) works fine (i.e. waits for more logs) for a container that is currently stopped. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

This fixes "docker logs --follow" to not exit (but wait for more logs) in case the container is currently not running. 1. ContainerLogs() does not modify the value of "follow" option. 2. ReadLogs() calls followLogs() even if the writer is closed (so that "docker logs --follow" will not exit on stopped container). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

to include docker/docker-py#2121 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2018-08-23T18:28:42Z

rebased; should help CI

kolyshkin · 2018-08-24T00:50:13Z

As much as I like this PR (it is relatively simple and elegant, and removes a lot of code), I am afraid it won't work, for the following reasons:

This PR makes a backward-incompatible change. It seems that existing software relies on current behavior of ContainerLogs(follow=true) to finish once a container is stopped (in other words, docker logs -f exits whenever container stops #37630 is not a bug but a feature). For one example of how this is backward-incompatible, see Fixes to logs --follow docker/docker-py#2121.
This PR relies on a feature to send remove event for an opened file which is currently not supported by fsnotify (Do not suppress Chmod on non-existent file fsnotify/fsnotify#260, Missing Remove Event when file handle open before for Linux fsnotify/fsnotify#194);
The above feature can't possibly work on Windows (as far as I know, one can't remove an opened file in Windows).

With that said, I see no other way but to return to #37576 -- which is complicated and ugly, but backward-compatible and solves the initial issue (#37391).

kolyshkin requested a review from vdemeester as a code owner August 16, 2018 13:27

GordonTheTurtle added the status/0-triage label Aug 16, 2018

This was referenced Aug 16, 2018

[do not merge] [test] another attempt to get rid of ProducerGone() #37655

Closed

daemon.ContainerLogs(): fix resource leak on follow #37576

Merged

kolyshkin force-pushed the logs-f-leak-2 branch from e2df134 to 4acbb67 Compare August 16, 2018 13:35

thaJeztah reviewed Aug 16, 2018

View reviewed changes

kolyshkin force-pushed the logs-f-leak-2 branch from 4acbb67 to b05dacb Compare August 16, 2018 14:00

kolyshkin force-pushed the logs-f-leak-2 branch 2 times, most recently from cb3c07d to 63244e8 Compare August 16, 2018 16:05

kolyshkin mentioned this pull request Aug 17, 2018

Fixes to logs --follow docker/docker-py#2121

Closed

kolyshkin force-pushed the logs-f-leak-2 branch from 63244e8 to 57af395 Compare August 17, 2018 17:31

kolyshkin requested a review from tianon as a code owner August 17, 2018 17:31

kolyshkin force-pushed the logs-f-leak-2 branch from 57af395 to 1bad926 Compare August 23, 2018 03:56

kolyshkin added 7 commits August 23, 2018 11:23

vendor: bump fsnotify

0d483e9

To include fsnotify/fsnotify#260 (fix for fsnotify/fsnotify#194). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Temporary bump docker-py

865104b

to include docker/docker-py#2121 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the logs-f-leak-2 branch from 1bad926 to 865104b Compare August 23, 2018 18:28

kolyshkin closed this Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT MERGE] daemon.ContainerLogs(): fix --follow #37656

[DO NOT MERGE] daemon.ContainerLogs(): fix --follow #37656

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[DO NOT MERGE] daemon.ContainerLogs(): fix --follow #37656

[DO NOT MERGE] daemon.ContainerLogs(): fix --follow #37656

Uh oh!

Conversation

Uh oh!

I. Goroutine leak on docker logs --follow (#37391).

II. Premature exit of docker logs --follow (#37630).

III. Complicated logic of following logs.

Now onto the fix.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!