-
Notifications
You must be signed in to change notification settings - Fork 297
Engine support for multiprocess orchestration #1021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Switch to allocating HugeTLB pages as mmaped files on hugetlbfs instead of POSIX shared memory. The advantage of POSIX shared memory was that we did not need to rely on having a hugetlbfs filesystem mounted -- but now we auto-mount one at /var/run/snabb/hugetlbfs. This design is a bit simpler and should also make it easier for one Snabb process to map a page that was allocated by a different process. The main motivation for rewriting memory.c is to make it easier to maintain. Lua code is more in keeping with the rest of Snabb and it doesn't require people to be expert C programmers and debate the merits of POSIX compliance and using GCC extensions and so on. (It also puts a bit more distance between Snabb and monstrosities like autoconf which cannot be a bad thing.)
New functions: start(options) stop() status() -> 'unstarted' | 'running' | 'stopped'
Just thinking aloud... Child processes will need an efficient way to check for an app network update and then a simple way to load it. The check could be done by polling a serial number in a shm object like This may involve (re)introducing some shm path syntax for accessing objects from the parent process. The children may also need to be able to access each others' DMA memory e.g. to make it easy for one app to allocate a DMA ring for a NIC and then another app to attach to the queue by using that memory. |
My first knee-jerk response is "I'll be glad to have a standardized sharing and error handling policy." :-) |
👍 The configuration serial is already stored in engine/configs. The parent PID could be stored in The SnabbBot failure seems spurious (we really have to do something about those). |
Added an API for creating alias names for shm objects. The aliases are implemented as symbolic links on the filesystem.
This is a work in progress. The new command is: snabb worker NAME PARENTPID CORE and the effect is to spawn a new Snabb process that simply executes an app network and reacts to configuration changes. The worker process looks in a "group/" shm folder of the PARENTPID to keep track of its current configuration. This is intended as an internal interface that the engine can use to parallelize an app network by delegating parts to different child processes. The processes would be managed directly by the engine and end users need not be aware of the "snabb worker" interface.
I have pushed the next step now: a This is intended as an internal interface that a parent Snabb process can use to spawn a child process that will simply execute an app network on a dedicated core & react to configuration updates. (On balance I think that fork() without exec() is too messy and risks inheriting a lot of state that we don't want the child processes to have.) The processes cooperate using a shared shm folder called The shm objects that I anticipate having in the
Here is how this would work in NFV with 100G (ConnectX):
Whaddayareckon? Next steps...
|
(Could be overkill to make the worker into |
Having a complete program for the worker process is overkill. This will be replaced by a core.worker module.
This module adds a simple API for spawning "worker" child processes that each run an app network on a dedicated core. The app network is provided and updated by the parent process.
I pushed a refactoring where a simple I have a small obstacle to overcome and I'm curious for @eugeneia and @wingo point of view because this is a precursor of sorts to the YANG work in #987. The problem is: How do I serialize a Today we already have functions like So how should we serialize a Could also be that there are related challenges e.g. places where we use app configuration tables that include non-serializable objects though I cannot immediately think of any. Just now I have punted on the problem by calling non-existent |
Obsoleted by core.worker module. This reverts commit a85017b.
Resolved conflict in memory.lua which now has a lock on nr_hugepages to prevent a race when provisioning new ones.
src/core/worker.lua
Outdated
-- Start a worker process with affinity to a specific CPU core. | ||
-- The child will execute an app network when provided with configure(). | ||
function start (name, core) | ||
local pid = S.fork() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic: should be assert(S.fork())
@lukego Hah, sorry for removing I think the YANG effort will require a way to serialize configs as well, so it seems sensible to expect to have I like the code in this PR very much! 👍 |
Previously there was an implicit dirname() that skipped the last path component. Removed this restriction to make the subroutine more useful.
HugeTLB memory allocated for DMA is now kept accessible in file-backed shared memory. This makes it possible for process to map each others' memory. The new function shutdown(pid) removes file-backing for a process that is terminating (in order to free memory for the OS). More specifically: Allocating a hugetlb page with address xxxxx creates: /var/run/snabb/hugetlb/xxxxx.dma File backed shared memory /var/run/snabb/$pid/group/dma/xxxxx.dma Symlink to above and the shutdown(pid) function will follow the symlinks to remove the file backing. (The reason for symlinks is that these directories are on separate filesystems, one hugetlbfs and one tmpfs.)
The new function shutdown(pid) will disable PCI bus mastering (i.e. DMA) for all PCI devices that were in use by the process (and its process group). This is intended as a safety mechanism to prevent "dangling" DMA requests on a device after a Snabb process terminates and the DMA memory is returned to the kernel. To keep track of which devices are in use we now create shm files with names like "group/dma/pci/01:00.0" for each device that we have enabled bus mastering for. These are the devices that are reverted by shutdown().
Call pci.shutdown() and memory.shutdown() when terminating the main Snabb process (not its worker children). The pci.shutdown() call will disabled DMA for all PCI devices that were in use. The memory.shutdown() will then delete the DMA memory backing files so that the memory will be returned to the kernel (after all processes that have mapped the memory terminate).
I have pushed some more code. The overall intention is for a group of Snabb processes (the main one + the workers) to be able to automatically share DMA memory between themselves, and to implement clean shutdown semantics where DMA is disabled for all PCI devices before any DMA memory is freed. Here are the specific invariants that this code aims to create:
The next step is to make DMA memory pointers globally valid for all Snabb processes in a group. This will be achieved with a SIGSEGV handler that detects access to unmapped DMA memory and automatically maps it in (if the address belongs to DMA memory allocated by a process in the group). I will use this immediately in the Mellanox driver so that the main process can create descriptor rings and then worker processes can easily access them. |
Added a SIGSEGV handler that attempts to automatically map addresses corresponding to DMA memory. The mapping succeeds when the DMA memory was allocated by any Snabb process in the process group. If the mapping does not succeed then the standard behavior of SIGSEGV is triggered. Includes a simple unit test in core.memory to allocate DMA memory, unmap it, and then access it anyway. Confirms that the signal handler runs the expected number of times.
I pushed the SIGSEGV handler now. This adds another invariant:
This means that any address returned by The immediate benefit is that the Mellanox driver will be able to allocate all of the descriptor rings in the parent process and then make the addresses available to worker processes via shm objects. |
Overall I feel like this branch represents a pretty reasonable approach to multiprocessing. The tricky bits we need to review are potential race conditions in the interactions between processes e.g. are configurations created atomically, are notifications (e.g. counter increments) made after all data is available, are all combinations of signals and termination orders handled equivalently, etc. I suppose we also need a clear description of this new "process group with workers" concept in the manual. This concept as evolved a little during implementation. |
The new environment variable SNABB_PROGRAM_LUACODE can be used to force a new Snabb process to execute a string of Lua code instead of running the standard program-selection logic. This is intended to support Snabb processes that want to spin off helper processes that need to execute specific code and be independent of the usual command-line syntax (e.g. sensitivity to argv[0] etc).
Worker processes now use execve() to create a totally new process image instead of using the one inherited by fork(). This is intended to simplify the relationship between parent and worker i.e. every process is an independently initialized Snabb instance. Uses the SNABB_PROGRAM_LUACODE hook to tell the worker process how to start.
This is currently very basic because not all of the worker machinery is working yet.
I pushed an update so that the worker processes use Added an initial selftest function too. The main thing lacking now is a working way to serialize a configuration object so that the parent can actually provide configurations to the children. Overall I feel like the design here is quite reasonable and the code is short, but the exact formulation has quite some rough edges. For example I have somewhat awkwardly broken the shm abstraction in order to operate directly on the underlying directories (e.g. to create symlinks to HugeTLB files that live in a directory that is not addressable by shm paths). |
So this is the simplest way I could think of to “serialize” app classes: lukego/snabb@multiprocess-engine...eugeneia:multiprocess-engine (I feel that I proposed this exact syntax before, but to which occasion?) It extends Another issue is that this change conflicts with #1019, which uses If I come up with something better I will let it be know. :-) |
Fix a regression where the memory module would not retry hugetlb allocations. pcall() was needed to trap allocation errors instead of letting them propagate. The old allocation primitive was written in C and always return NULL on error but the new one is written in Lua and uses assertions.
How about now with the fix on 987dd77? I believe the problematic case was when Snabb needs to ask the kernel to provision a new huge page after an allocation failure. This would tend to trigger if you have never run a Snabb process since boot (or after you run |
Several hundred test runs later, that's looking a lot like a fix, thank you. What's the basic idea behind retrying these allocations, by the way? |
Good question. This is a DWIM feature to make Snabb easier to get started with. (Could be that it should be disabled for programs that are more concerned about correctness than convenient in installation.) Just now you can install and run Snabb on a new machine very easily:
and if your machine was not already perfectly prepared then you may see some messages like this:
This is Snabb attempting to DWIM instead of exiting with error messages like:
which feels like creating a lot of schlep work for the user. (You could also say that it is a workaround for the Linux kernel lacking a suitable mechanism for dynamically allocating a hugetlb page, at least that I am aware of.) |
(Could add a |
This is a downstream cherry-pick of this change: justincormack/ljsyscall#205 Required to prevent a double-close problem in memory.lua.
@kbara Please pull this branch again to get commit 7adc708 which fixes a file descriptor double-close bug in |
I pushed API documentation for the |
src/core/worker.lua
Outdated
|
||
-- Run the engine with continuous configuration updates | ||
local current_config | ||
local child_path = "group/config/..name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petebristow notes that this looks like the wrong thing (and if it isn't it needs a comment I think): #1133 (review)
WDYT @lukego?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahem! That whole init()
function is actually dead code. Removed in 876addc.
That's from an older version where core.worker
imposed specific behavior on the worker process i.e. running an app network provided by the parent. These days the worker simply links itself into the process tree and runs the provided Lua code.
You will need Igalia#809 at least if this is to work with intel_mp. Do you want to make a separate merge branch that is based here, then merges in master, then pr's back to master? |
Probably you want the worker.lua part of 3e32819 as well. |
You might also need 876addc |
…-multiproc-lwaftr
Add limit-finding loadtester.
This is a work-in-progress effort to support multiprocess app networks.
The basic idea:
engine.start()
to enable automatically running the current engine configuration in the background. (Like an asynchronous version ofengine.main()
).engine.configure()
to optionally take a table of configurations that will each execute with a dedicated process and CPU core. Example:engine.configure({worker1 = conf1, worker2 = conf2, ...})
.engine.configure()
delta-updates work across all processes. That is, the parent process will diff the configuration to start/stop worker processes as needed, and then each worker process will do a local diff to update its own app network (the same logic that already exists today).I would imagine this to be a new API alongside the synchronous/single-process
engine.main()
. Hopefully this would be suitable for all applications to use in the future. In the short term we may need to invent some new communication mechanisms (e.g. shm conventions) to support use cases where application code is inspecting the app network directly (won't be possible when the apps are running in a child process).The immediate use case here is to update the NFV application to have a single 100G NIC shared between many worker processes that are each serving a distinct set of VMs. So the NFV application would take many port configuration files (one per process/core) but only one NIC PCI address.
Thoughts? cc @wingo @eugeneia @kbara @plajjan @alexandergall @dpino @petebristow and well everybody else too...