8000 stage1: enable host IPC namespace by alban · Pull Request #3787 · rkt/rkt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Feb 24, 2020. It is now read-only.

stage1: enable host IPC namespace #3787

Merged
merged 6 commits into from
Sep 12, 2017
Merged

stage1: enable host IPC namespace #3787

merged 6 commits into from
Sep 12, 2017

Conversation

alban
Copy link
Member
@alban alban commented Sep 6, 2017

rkt normally creates a new IPC namespace for the pod. In order to stay in the host IPC namespace, a new option is added:

      --ipc=[auto|private|parent]

This can be tested with:

$ ls -l /proc/self/ns/ipc
$ sudo rkt run --insecure-options=image kinvolk.io/aci/busybox --exec /bin/sh -- -c 'ls -l /proc/self/ns/ipc'
$ sudo rkt run --insecure-options=image --ipc=parent kinvolk.io/aci/busybox --exec /bin/sh -- -c 'ls -l /proc/self/ns/ipc'
$ sudo rkt run --insecure-options=image --ipc=private kinvolk.io/aci/busybox --exec /bin/sh -- -c 'ls -l /proc/self/ns/ipc'

We can see that the ipc namespace id is the same as the host when using --ipc=parent but different otherwise.

In the kvm flavor, using the host IPC namespace is not allowed and rkt will error out.

The implementation relies on the systemd-nspawn new environment variable SYSTEMD_NSPAWN_SHARE_NS_IPC=true added in systemd v232, see systemd/systemd#4023.

Since the default "coreos" stage1 currently uses systemd v231, I needed to upgrade to a more recent systemd version, following the "Update coreos flavor stage1" documentation.

Although systemd-v234 is already released, I am only upgrading from v231 to v233 to avoid a regression bug when the 4 year old v204 is used on the host. v204 is still used on SemaphoreCI. The symptoms were:

Press ^] three times within 1s to kill container.
Failed to allocate scope: No such method 'StartTransientUnit'

This is due to a change in systemd-nspawn-v234: see: https://github.com/systemd/systemd/pull/6261/files#diff-a8646f412b928a4679f2ec702b55c12bR338. With this changes, it now calls 'StartTransientUnit' synchronously. Before, it was called asynchronously without checking for errors. This 'StartTransientUnit' method was added in systemd-v205 (via systemd/systemd@c2756a684011) so it is not yet available in SemaphoreCI.

I contacted SemaphoreCI and they hopefully will have a new testing platform with Ubuntu 16 in the next few weeks. Once that platform is available, we could upgrade stage1 again without interruption in the tests. I don't think we care about support for systemd v204 if not for the tests.

This PR also includes a regression fix that was caused by upgrading systemd in stage1. We ensure that the new systemd option MountAPIVFS stays disabled by disabling ProtectKernelTunables.

Finally, it includes a compilation fix in the kvm flavor with Linux < 4.11 and with gcc >= 7.

Fixes #3291

/cc @bcg62 @iaguis @lucab


TODO

@alban alban requested review from lucab and iaguis September 6, 2017 12:01
@@ -468,6 +469,10 @@ func getArgsEnv(p *stage1commontypes.Pod, flavor string, canMachinedRegister boo
env = append(env, "SYSTEMD_LOG_LEVEL=err") // silence log_warning too
}

if hostIPC {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to be called parentIPC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will update this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@lucab
Copy link
Member
lucab commented Sep 6, 2017

Looks quite in line with my initial investigation from #3291.
I guess we may want to split the IPCMode validation/parsing code and the actual flag setting, as the latter varies in compatibility & default across flavors. I'm fine with leaving stage1-fly our of this PR for the moment and leaving a followup ticket, if you prefer.

Although systemd-v234 is already released, I am only upgrading from v231 to v233 to avoid a regression bug when the 4 year old v204 is used on the host.

I don't want to put too much focus on legacy systems, but should we detect <205 and skip registration on that? I seem to understand the misbehavior is on systemd-nspawn registration. This should allow to sync to latest.

the journal seems broken

more on this?

@alban
Copy link
Member Author
alban commented Sep 6, 2017

the journal seems broken

more on this?

journalctl -M still works fine but journalctl -m _MACHINE_ID= as documented in commands.md does not work anymore. I see that inside the pod, the journal files are in /run/log/journal/$id instead of /var/log/journal/$id where it used to be. It breaks the systemd-nspawn option --link-journal=try-guest that rkt is using. Logs are not persisted after the pod is stopped. I am trying to find out what changed...

@alban
Copy link
Member Author
alban commented Sep 8, 2017

The journal issue is fixed, I added a commit about it ("stage1: add systemd-journal-flush.service") with an explanation in the commit message. It works for me both on the "coreos" and the "host" flavor.

Btw, giving --ipc in the fly flavor gives a warning at the moment:

stage0: warning: --ipc option is not supported by stage1

I'll file a follow-up ticket and continue with the other TODO items.

@alban alban changed the title [WIP] stage1: enable host IPC namespace stage1: enable host IPC namespace Sep 8, 2017
Copy link
Member
@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

@@ -1,5 +1,7 @@
lib64/tmpfiles.d/journal-nocow.conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? And if we do, can we add it for the host flavor and ARM too?

The docs in the file say it only matters for btrfs filesystems and it improves performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's a left-over from an experiment. I'll remove it from this PR.

0,
},
{
"--ipc=supercalifragilisticexpialidocious",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

default:
log.Fatalf("unknown value for --ipc parameter: %v", p.IPCMode)
}
if parentIPC && flavor == "kvm" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this for the "fly" flavor? That is, if we pass --ipc=private

Right now we just display a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we won't get here because stage0 will prevent us from passing --ipc= before, maybe it's good to keep it for extra safety?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iaguis "fly" is implemented in the stage1_fly directory and it does not link with stage1/init/init.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I mean is if we should do something similar in stage1_fly (that is, explicitly refuse ipc=private)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... could it be left for the follow-up issue #3789 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@glevand
Copy link
Contributor
glevand commented Sep 11, 2017

I tested the coreos flavor with these three patches on arm64:

  • stage1: upgrade to systemd-v233 from CoreOS 1478.0.0
  • stage1: add systemd-journal-flush.service
  • stage1: ensure systemd parameter MountAPIVFS is disabled

I found I needed theses changes to the `systemd.manifest': https://gist.github.com/glevand/2e9e92568e3b5fa29f431c475b984b70

Also, it would be good to keep the manifest files sorted. I made a small script I planed to submit to do that here: https://github.com/glevand/coreos--rkt/blob/for-merge-sort/scripts/sort-stage1-manifests.sh.

Symptoms:

kernel/built-in.o: In function `update_wall_time':
(.text+0x5acc6): undefined reference to `____ilog2_NaN'

This commit includes the workaround patch from the upstream Linux
kernel. Ideally, we would switch to a newer Linux version for the kvm
flavor. But this is quicker this way.

See:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72785
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=474c90156c8dcc2fa815e6716cc9394d7930cb9c
systemd-v234 is already released. However I am only upgrading from
v231 to v233 to avoid a regression bug when the 4 year old v204 is used
on the host. v204 is still used on SemaphoreCI. The symptoms were:

> Press ^] three times within 1s to kill container.
> Failed to allocate scope: No such method 'StartTransientUnit'

This is due to a change in systemd-nspawn-v234: see:
https://github.com/systemd/systemd/pull/6261/files#diff-a8646f412b928a4679f2ec702b55c12bR338

It now calls 'StartTransientUnit' synchronously. Before, it was called
asynchronously without checking for errors. This 'StartTransientUnit'
method was added in systemd-v205 (via
systemd/systemd@c2756a684011) so it is not yet
available in SemaphoreCI.
systemd-v233 introduced a change in journald: the logs will not be
written in /var/log/journal until it is asked to. This is done by the
new unit systemd-journal-flush.service but rkt didn't include that unit.

rkt uses the systemd-nspawn option '--link-journal=try-guest', which
means the logs from the pod's /var/log/journal directory will be bind-mounted
to the host. So it is important to use the correct directory to have
"journalctl -M" working properly.

See: systemd/systemd@f78273c

This patch adds this new unit systemd-journal-flush.service in stage1,
both on the host flavor (taking the unit from the host if rkt detects systemd
>= v233) and on the default flavor (taking the unit from the coreos
image).
systemd implicitely set MountAPIVFS to true whenever we have
ProtectKernelTunables=true.

It was causing a regression on test TestVolumeSysfs with systemd >= v233

Symptoms:

> rkt_tests.go:410: didn't receive expected output "/sys/class: mode: Dcrw-rw-rw-": image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.28.1+gitc779c926
>
> [ 4385.848746] rkt-inspect[5]: /sys/class: mode: drwxr-xr-x
rkt normally creates a new IPC namespace for the pod. In order to stay
in the host IPC namespace, a new option is added:

  --ipc=[auto|private|parent]
@alban
Copy link
Member Author
alban commented Sep 12, 2017

@glevand thanks for testing! I updated the PR with your changes for arm64 and used your script to sort the manifest files.

@alban
Copy link
Member Author
alban commented Sep 12, 2017

When testing, TestAttachSmoke failed a couple of times but it is unrelated to this PR since it also failed on master. I filed a new bug to track this: #3793

I am triggering the CI again...

@iaguis
Copy link
Member
iaguis commented Sep 12, 2017

@glevand Thanks a lot for testing!

LGTM

@alban alban merged commit 3e4a290 into rkt:master Sep 12, 2017
iaguis added a commit to kinvolk/rkt that referenced this pull request Oct 12, 2017
This reverts commit 3e4a290, reversing
changes made to b429dc4.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stage1: implement IPC namespace sharing
4 participants
0