Engine support for multiprocess orchestration #1021

lukego · 2016-09-19T12:05:53Z

This is a work-in-progress effort to support multiprocess app networks.

The basic idea:

Add engine.start() to enable automatically running the current engine configuration in the background. (Like an asynchronous version of engine.main()).
Extend engine.configure() to optionally take a table of configurations that will each execute with a dedicated process and CPU core. Example: engine.configure({worker1 = conf1, worker2 = conf2, ...}).
Make sure engine.configure() delta-updates work across all processes. That is, the parent process will diff the configuration to start/stop worker processes as needed, and then each worker process will do a local diff to update its own app network (the same logic that already exists today).
Implement a sharing and error handling policy. For example all DMA memory is globally available between workers and failure of any process terminates everything.

I would imagine this to be a new API alongside the synchronous/single-process engine.main(). Hopefully this would be suitable for all applications to use in the future. In the short term we may need to invent some new communication mechanisms (e.g. shm conventions) to support use cases where application code is inspecting the app network directly (won't be possible when the apps are running in a child process).

The immediate use case here is to update the NFV application to have a single 100G NIC shared between many worker processes that are each serving a distinct set of VMs. So the NFV application would take many port configuration files (one per process/core) but only one NIC PCI address.

Thoughts? cc @wingo @eugeneia @kbara @plajjan @alexandergall @dpino @petebristow and well everybody else too...

Switch to allocating HugeTLB pages as mmaped files on hugetlbfs instead of POSIX shared memory. The advantage of POSIX shared memory was that we did not need to rely on having a hugetlbfs filesystem mounted -- but now we auto-mount one at /var/run/snabb/hugetlbfs. This design is a bit simpler and should also make it easier for one Snabb process to map a page that was allocated by a different process. The main motivation for rewriting memory.c is to make it easier to maintain. Lua code is more in keeping with the rest of Snabb and it doesn't require people to be expert C programmers and debate the merits of POSIX compliance and using GCC extensions and so on. (It also puts a bit more distance between Snabb and monstrosities like autoconf which cannot be a bad thing.)

New functions: start(options) stop() status() -> 'unstarted' | 'running' | 'stopped'

lukego · 2016-09-19T12:17:52Z

Just thinking aloud...

Child processes will need an efficient way to check for an app network update and then a simple way to load it. The check could be done by polling a serial number in a shm object like //config/serial that the parent bumps after each call to engine.configure(). The configuration could be loaded by reading the overall configuration from a text file //config/configuration.$serial and picking the relevant bit.

This may involve (re)introducing some shm path syntax for accessing objects from the parent process. The children may also need to be able to access each others' DMA memory e.g. to make it easy for one app to allocate a DMA ring for a NIC and then another app to attach to the queue by using that memory.

kbara · 2016-09-19T12:51:45Z

My first knee-jerk response is "I'll be glad to have a standardized sharing and error handling policy." :-)

eugeneia · 2016-09-19T13:51:49Z

👍

The configuration serial is already stored in engine/configs. The parent PID could be stored in parent-pid so that the parent can be accessed using /{parent-pid}/foo.

The SnabbBot failure seems spurious (we really have to do something about those).

Added an API for creating alias names for shm objects. The aliases are implemented as symbolic links on the filesystem.

This is a work in progress. The new command is: snabb worker NAME PARENTPID CORE and the effect is to spawn a new Snabb process that simply executes an app network and reacts to configuration changes. The worker process looks in a "group/" shm folder of the PARENTPID to keep track of its current configuration. This is intended as an internal interface that the engine can use to parallelize an app network by delegating parts to different child processes. The processes would be managed directly by the engine and end users need not be aware of the "snabb worker" interface.

lukego · 2016-09-29T15:35:35Z

I have pushed the next step now: a snabb worker ... program.

This is intended as an internal interface that a parent Snabb process can use to spawn a child process that will simply execute an app network on a dedicated core & react to configuration updates. (On balance I think that fork() without exec() is too messy and risks inheriting a lot of state that we don't want the child processes to have.)

The processes cooperate using a shared shm folder called group/. This is created by the parent process and then aliased into each child process (as a symlink).

The shm objects that I anticipate having in the group folder are:

group/myworker/config.lua       # current configuration for myworker
group/myworker/configs.counter  # incremented by parent on new config
group/dma/7e42ca00              # DMA HugeTLBs shared by all processes
group/pci/01:00.0/...           # Objects made available by driver

Here is how this would work in NFV with 100G (ConnectX):

Master process initializes the NIC, sets up switching, allocates all descriptor rings, polls counters (non-realtime activities).
Worker processes are each run a separate app network (serving different VMs) provided by the master process (based on the configuration file).
Worker process drivers attach to the NIC via descriptor rings (allocated in shared DMA memory & with addresses provided as shm objects).
DMA memory would stay allocated until the whole process tree shuts down (cleaned up by the supervisor of the master process).
Supervisor would also automatically disable PCIe bus mastering (DMA) for all PCI devices. This is a safeguard against the descriptor ring memory being reused while still referenced by a NIC.

Whaddayareckon?

Next steps...

Add a SIGSEGV handler to automatically map DMA memory from other processes (have prototyped this before);
Add the engine API for running multiple app networks in subprocesses.

lukego · 2016-09-30T07:06:56Z

(Could be overkill to make the worker into snabb worker program. I'll try factoring it simply into a core.worker module.)

Having a complete program for the worker process is overkill. This will be replaced by a core.worker module.

This module adds a simple API for spawning "worker" child processes that each run an app network on a dedicated core. The app network is provided and updated by the parent process.

lukego · 2016-09-30T09:55:34Z

I pushed a refactoring where a simple core.worker module implements both the parent process and the child process code. This is currently sketch quality (compiles but won't run). Currently creating the worker with fork() but will switch this to fork()+exec() with some simple entry point (maybe snsh).

I have a small obstacle to overcome and I'm curious for @eugeneia and @wingo point of view because this is a precursor of sorts to the YANG work in #987.

The problem is: How do I serialize a config object to pass it from the parent process to the child process?

Today we already have functions like lib.store_conf() and lib.load_conf() that can serialize Lua objects to text files (pretty-print them and then parse them). The catch is that a config object includes Lua objects that cannot be directly serialized, particularly app classes which are typically tables of functions and closures.

So how should we serialize a config object? One idea would be to define a canonical string name for each app and then to use this to indirectly refer to the class object.

Could also be that there are related challenges e.g. places where we use app configuration tables that include non-serializable objects though I cannot immediately think of any.

Just now I have punted on the problem by calling non-existent config.save() and config.load() functions.

Obsoleted by core.worker module. This reverts commit a85017b.

Resolved conflict in memory.lua which now has a lock on nr_hugepages to prevent a race when provisioning new ones.

eugeneia · 2016-09-30T12:28:32Z

src/core/worker.lua

+-- Start a worker process with affinity to a specific CPU core.
+-- The child will execute an app network when provided with configure().
+function start (name, core)
+   local pid = S.fork()


Pedantic: should be assert(S.fork())

eugeneia · 2016-09-30T12:49:40Z

@lukego Hah, sorry for removing alias like a couple weeks ago. ;-)

I think the YANG effort will require a way to serialize configs as well, so it seems sensible to expect to have config.save and config.load.

I like the code in this PR very much! 👍

Previously there was an implicit dirname() that skipped the last path component. Removed this restriction to make the subroutine more useful.

HugeTLB memory allocated for DMA is now kept accessible in file-backed shared memory. This makes it possible for process to map each others' memory. The new function shutdown(pid) removes file-backing for a process that is terminating (in order to free memory for the OS). More specifically: Allocating a hugetlb page with address xxxxx creates: /var/run/snabb/hugetlb/xxxxx.dma File backed shared memory /var/run/snabb/$pid/group/dma/xxxxx.dma Symlink to above and the shutdown(pid) function will follow the symlinks to remove the file backing. (The reason for symlinks is that these directories are on separate filesystems, one hugetlbfs and one tmpfs.)

The new function shutdown(pid) will disable PCI bus mastering (i.e. DMA) for all PCI devices that were in use by the process (and its process group). This is intended as a safety mechanism to prevent "dangling" DMA requests on a device after a Snabb process terminates and the DMA memory is returned to the kernel. To keep track of which devices are in use we now create shm files with names like "group/dma/pci/01:00.0" for each device that we have enabled bus mastering for. These are the devices that are reverted by shutdown().

Call pci.shutdown() and memory.shutdown() when terminating the main Snabb process (not its worker children). The pci.shutdown() call will disabled DMA for all PCI devices that were in use. The memory.shutdown() will then delete the DMA memory backing files so that the memory will be returned to the kernel (after all processes that have mapped the memory terminate).

lukego · 2016-10-02T20:19:05Z

I have pushed some more code. The overall intention is for a group of Snabb processes (the main one + the workers) to be able to automatically share DMA memory between themselves, and to implement clean shutdown semantics where DMA is disabled for all PCI devices before any DMA memory is freed.

Here are the specific invariants that this code aims to create:

Each HugeTLB allocated by a running group of Snabb processes is accessible to each process via the shm path group/dma/xxx.dma. (xxx is the physical address of the memory.)
DMA memory is never freed while any PCI device has bus mastering (DMA) enabled by a process in the group.
DMA memory is always freed during shutdown of a Snabb process group i.e. backing files on hugetlbfs do not outlive the processes.

The next step is to make DMA memory pointers globally valid for all Snabb processes in a group. This will be achieved with a SIGSEGV handler that detects access to unmapped DMA memory and automatically maps it in (if the address belongs to DMA memory allocated by a process in the group). I will use this immediately in the Mellanox driver so that the main process can create descriptor rings and then worker processes can easily access them.

Added a SIGSEGV handler that attempts to automatically map addresses corresponding to DMA memory. The mapping succeeds when the DMA memory was allocated by any Snabb process in the process group. If the mapping does not succeed then the standard behavior of SIGSEGV is triggered. Includes a simple unit test in core.memory to allocate DMA memory, unmap it, and then access it anyway. Confirms that the signal handler runs the expected number of times.

lukego · 2016-10-03T13:01:44Z

I pushed the SIGSEGV handler now. This adds another invariant:

DMA memory allocated by any process is automatically usable by every other process in the same group.

This means that any address returned by memory.dma_alloc() can be shared between processes. You have to be a bit careful about this, e.g. if you pass a packet from one process to another then it may end up on the wrong freelist when you are done with it, and you would need to take care of this e.g. as in #601.

The immediate benefit is that the Mellanox driver will be able to allocate all of the descriptor rings in the parent process and then make the addresses available to worker processes via shm objects.

lukego · 2016-10-03T13:11:04Z

Overall I feel like this branch represents a pretty reasonable approach to multiprocessing. The tricky bits we need to review are potential race conditions in the interactions between processes e.g. are configurations created atomically, are notifications (e.g. counter increments) made after all data is available, are all combinations of signals and termination orders handled equivalently, etc.

I suppose we also need a clear description of this new "process group with workers" concept in the manual. This concept as evolved a little during implementation.

The new environment variable SNABB_PROGRAM_LUACODE can be used to force a new Snabb process to execute a string of Lua code instead of running the standard program-selection logic. This is intended to support Snabb processes that want to spin off helper processes that need to execute specific code and be independent of the usual command-line syntax (e.g. sensitivity to argv[0] etc).

Worker processes now use execve() to create a totally new process image instead of using the one inherited by fork(). This is intended to simplify the relationship between parent and worker i.e. every process is an independently initialized Snabb instance. Uses the SNABB_PROGRAM_LUACODE hook to tell the worker process how to start.

This is currently very basic because not all of the worker machinery is working yet.

lukego · 2016-10-04T10:29:27Z

I pushed an update so that the worker processes use execve(2) to create an entirely new process image. This way the worker processes are all normal Snabb instances that have been initialized from scratch. This is intended to avoid a whole class of errors from inheriting state with fork().

Added an initial selftest function too.

The main thing lacking now is a working way to serialize a configuration object so that the parent can actually provide configurations to the children.

Overall I feel like the design here is quite reasonable and the code is short, but the exact formulation has quite some rough edges. For example I have somewhat awkwardly broken the shm abstraction in order to operate directly on the underlying directories (e.g. to create symlinks to HugeTLB files that live in a directory that is not addressable by shm paths).

eugeneia · 2016-10-06T19:50:59Z

So this is the simplest way I could think of to “serialize” app classes: lukego/snabb@multiprocess-engine...eugeneia:multiprocess-engine (I feel that I proposed this exact syntax before, but to which occasion?)

It extends core.config to accept app classes as string identifiers in the form "<module>[/<var>]" and updates core.app accordingly. The main caveat is that the programmer has to ensure he only uses string identifiers for classes when using core.worker. We could just stop supporting app classes passed as Lua objects, but that seems harsh.

Another issue is that this change conflicts with #1019, which uses class.config to reject invalid app arguments during config.app(!). I chose to parse app arguments in config because I felt it was a good idea to fail fast, but this could be done in engine.configure as well.

If I come up with something better I will let it be know. :-)

Fix a regression where the memory module would not retry hugetlb allocations. pcall() was needed to trap allocation errors instead of letting them propagate. The old allocation primitive was written in C and always return NULL on error but the new one is written in Lua and uses assertions.

lukego · 2016-11-16T13:28:06Z

How about now with the fix on 987dd77?

I believe the problematic case was when Snabb needs to ask the kernel to provision a new huge page after an allocation failure. This would tend to trigger if you have never run a Snabb process since boot (or after you run sysctl -w vm.nr_hugepages=0).

kbara · 2016-11-16T13:35:35Z

Several hundred test runs later, that's looking a lot like a fix, thank you. What's the basic idea behind retrying these allocations, by the way?

lukego · 2016-11-16T14:33:51Z

Good question. This is a DWIM feature to make Snabb easier to get started with. (Could be that it should be disabled for programs that are more concerned about correctness than convenient in installation.)

Just now you can install and run Snabb on a new machine very easily:

git clone https://github.com/snabbco/snabb
make -C snabb
snabb/src/snabb packetblaster -S 60 01:00.0

and if your machine was not already perfectly prepared then you may see some messages like this:

[mounting /var/run/snabb/hugetlbfs]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 0 -> 1]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 1 -> 2]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 2 -> 3]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 3 -> 4]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 4 -> 5]

This is Snabb attempting to DWIM instead of exiting with error messages like:

Error: Failed to provision a huge page.

Please take the following actions:

1. Reserve huge pages in your grub.conf:
      hugepages=nnn
   (where nnn is the max number of pages you will need for Snabb.)
2. Create directory /mnt/hugetlbfs
3. Add this entry to /etc/fstab:
      hugetlbfs       /mnt/hugetlbfs  hugetlbfs       defaults        0 0
4. Reboot.

which feels like creating a lot of schlep work for the user.

(You could also say that it is a workaround for the Linux kernel lacking a suitable mechanism for dynamically allocating a hugetlb page, at least that I am aware of.)

lukego · 2016-11-16T14:35:22Z

(Could add a DWIM configuration option to the engine to control this behavior. Once the YANG support lands perhaps we can model the engine config that way?)

… into nfv2016

This is a downstream cherry-pick of this change: justincormack/ljsyscall#205 Required to prevent a double-close problem in memory.lua.

lukego · 2016-11-26T09:25:58Z

@kbara Please pull this branch again to get commit 7adc708 which fixes a file descriptor double-close bug in memory.lua. This can have unpredictable effects - was seen to asynchronously close vhost-user sockets in the NFV application. See also justincormack/ljsyscall#205.

lukego · 2016-11-26T15:24:27Z

This branch depends on #1077. Please merge @kbara. I did not pull here because it is based on a different ancestor (newer version of master) and would make this PR much more noisy to pull that too.

lukego · 2017-05-17T12:13:50Z

I pushed API documentation for the core.worker multiprocess API: https://github.com/snabbco/snabb/blob/82f335573a743fd608baa5ee966d3c5ef35183ab/src/README.md#multiprocess-operation-coreworker.

wingo · 2017-05-17T12:29:17Z

src/core/worker.lua

+
+   -- Run the engine with continuous configuration updates
+   local current_config
+   local child_path = "group/config/..name"


@petebristow notes that this looks like the wrong thing (and if it isn't it needs a comment I think): #1133 (review)

WDYT @lukego?

ahem! That whole init() function is actually dead code. Removed in 876addc.

That's from an older version where core.worker imposed specific behavior on the worker process i.e. running an app network provided by the parent. These days the worker simply links itself into the process tree and runs the provided Lua code.

wingo · 2017-05-17T12:30:44Z

You will need Igalia#809 at least if this is to work with intel_mp. Do you want to make a separate merge branch that is based here, then merges in master, then pr's back to master?

wingo · 2017-05-17T12:32:17Z

Probably you want the worker.lua part of 3e32819 as well.

wingo · 2017-05-17T12:36:31Z

You might also need 876addc

…-multiproc-lwaftr

Add limit-finding loadtester.

lukego added 2 commits February 10, 2016 17:26

engine: Extend API for spawning a background worker.

a85017b

New functions: start(options) stop() status() -> 'unstarted' | 'running' | 'stopped'

lukego added the rfc label Sep 19, 2016

lukego added 2 commits September 29, 2016 15:16

core.shm: Added shm.alias(newpath, oldpath)

a7c2194

Added an API for creating alias names for shm objects. The aliases are implemented as symbolic links on the filesystem.

lukego added 2 commits September 30, 2016 09:37

program.worker: Removed (for replacement)

42f25ee

Having a complete program for the worker process is overkill. This will be replaced by a core.worker module.

core.worker: New module to run worker processes

4922af8

This module adds a simple API for spawning "worker" child processes that each run an app network on a dedicated core. The app network is provided and updated by the parent process.

lukego added 2 commits September 30, 2016 10:07

Revert "engine: Extend API for spawning a background worker."

0d717a4

Obsoleted by core.worker module. This reverts commit a85017b.

Merge branch 'memory-ljsyscall' into multiprocess-engine

e3db28a

Resolved conflict in memory.lua which now has a lock on nr_hugepages to prevent a race when provisioning new ones.

eugeneia reviewed Sep 30, 2016

View reviewed changes

lukego added 4 commits October 2, 2016 17:22

shm: mkdir() now creates full path

6e28390

Previously there was an implicit dirname() that skipped the last path component. Removed this restriction to make the subroutine more useful.

lukego added 3 commits October 4, 2016 10:11

core.worker: Added simple selftest() function

1e1f889

This is currently very basic because not all of the worker machinery is working yet.

kbara referenced this pull request in Igalia/snabb Nov 23, 2016

A3E2 Merge PR #1021 (Luke's multiproces engine orchestration) into lw_multi

4f5bb46

eugeneia added a commit to eugeneia/snabb that referenced this pull request Nov 23, 2016

Merge PR snabbco#1021 (Engine support for multiprocess orchestration)…

a401805

… into nfv2016

kbara referenced this pull request in Igalia/snabb Nov 25, 2016

Merge PR #1021 (Luke's multiprocess orchestration) into lw_lmulti_base

6c64c8c

kbara mentioned this pull request Nov 25, 2016

Snabb lwaftr + Luke's multiprocess orchestration Igalia/snabb#595

Closed

ljsyscall: Add more protection against fd double-close

7adc708

This is a downstream cherry-pick of this change: justincormack/ljsyscall#205 Required to prevent a double-close problem in memory.lua.

lukego mentioned this pull request Nov 26, 2016

core.main: Make supervisor child process instead of parent #1077

Merged

kbara referenced this pull request in Igalia/snabb Nov 28, 2016

Merge PR #1021 (Luke's multiprocess orchestration) into lw_multi_again

ba4702c

wingo mentioned this pull request Apr 24, 2017

Merge Igalia/lwaftr upstream #1133

Merged

wingo mentioned this pull request May 17, 2017

Cleanup without supervisor trees #1145

Open

README.md: Document core.worker multiprocessing

82f3355

wingo reviewed May 17, 2017

View reviewed changes

core.worker: Remove dead code

6d7e516

lukego added a commit to lukego/snabb that referenced this pull request May 22, 2017

Merge snabbco#1021 branch 'lukego/multiprocess-engine' into integrate…

579a12a

…-multiproc-lwaftr

lukego mentioned this pull request May 22, 2017

Merge multiprocess and lwaftr branches #1148

Merged

lukego changed the title ~~[RFC] [WIP] Engine support for multiprocess orchestration~~ Engine support for multiprocess orchestration Jun 2, 2017

lukego added merged and removed rfc labels Jun 2, 2017

eugeneia merged commit 6d7e516 into snabbco:master Jun 23, 2017

dpino pushed a commit to dpino/snabb that referenced this pull request Feb 21, 2018

Merge pull request snabbco#1021 from Igalia/find-limit

9c82d89

Add limit-finding loadtester.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine support for multiprocess orchestration #1021

Engine support for multiprocess orchestration #1021

Engine support for multiprocess orchestration #1021

Engine support for multiprocess orchestration #1021

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment