8000 Engine support for multiprocess orchestration by lukego · Pull Request #1021 · snabbco/snabb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Engine support for multiprocess orchestration #1021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 23, 2017

Conversation

lukego
Copy link
Member
@lukego lukego commented Sep 19, 2016

This is a work-in-progress effort to support multiprocess app networks.

The basic idea:

  • Add engine.start() to enable automatically running the current engine configuration in the background. (Like an asynchronous version of engine.main()).
  • Extend engine.configure() to optionally take a table of configurations that will each execute with a dedicated process and CPU core. Example: engine.configure({worker1 = conf1, worker2 = conf2, ...}).
  • Make sure engine.configure() delta-updates work across all processes. That is, the parent process will diff the configuration to start/stop worker processes as needed, and then each worker process will do a local diff to update its own app network (the same logic that already exists today).
  • Implement a sharing and error handling policy. For example all DMA memory is globally available between workers and failure of any process terminates everything.

I would imagine this to be a new API alongside the synchronous/single-process engine.main(). Hopefully this would be suitable for all applications to use in the future. In the short term we may need to invent some new communication mechanisms (e.g. shm conventions) to support use cases where application code is inspecting the app network directly (won't be possible when the apps are running in a child process).

The immediate use case here is to update the NFV application to have a single 100G NIC shared between many worker processes that are each serving a distinct set of VMs. So the NFV application would take many port configuration files (one per process/core) but only one NIC PCI address.

Thoughts? cc @wingo @eugeneia @kbara @plajjan @alexandergall @dpino @petebristow and well everybody else too...

Switch to allocating HugeTLB pages as mmaped files on hugetlbfs
instead of POSIX shared memory. The advantage of POSIX shared memory
was that we did not need to rely on having a hugetlbfs filesystem
mounted -- but now we auto-mount one at /var/run/snabb/hugetlbfs.

This design is a bit simpler and should also make it easier for one
Snabb process to map a page that was allocated by a different process.

The main motivation for rewriting memory.c is to make it easier to
maintain. Lua code is more in keeping with the rest of Snabb and it
doesn't require people to be expert C programmers and debate the
merits of POSIX compliance and using GCC extensions and so on. (It
also puts a bit more distance between Snabb and monstrosities like
autoconf which cannot be a bad thing.)
New functions:
    start(options)
    stop()
    status() -> 'unstarted' | 'running' | 'stopped'
@lukego lukego added the rfc label Sep 19, 2016
@lukego
Copy link
Member Author
lukego commented Sep 19, 2016

Just thinking aloud...

Child processes will need an efficient way to check for an app network update and then a simple way to load it. The check could be done by polling a serial number in a shm object like //config/serial that the parent bumps after each call to engine.configure(). The configuration could be loaded by reading the overall configuration from a text file //config/configuration.$serial and picking the relevant bit.

This may involve (re)introducing some shm path syntax for accessing objects from the parent process. The children may also need to be able to access each others' DMA memory e.g. to make it easy for one app to allocate a DMA ring for a NIC and then another app to attach to the queue by using that memory.

@kbara
Copy link
Contributor
kbara commented Sep 19, 2016

My first knee-jerk response is "I'll be glad to have a standardized sharing and error handling policy." :-)

@eugeneia
Copy link
Member
eugeneia commented Sep 19, 2016

👍

The configuration serial is already stored in engine/configs. The parent PID could be stored in parent-pid so that the parent can be accessed using /{parent-pid}/foo.

The SnabbBot failure seems spurious (we really have to do something about those).

Added an API for creating alias names for shm objects. The aliases are
implemented as symbolic links on the filesystem.
This is a work in progress. The new command is:

  snabb worker NAME PARENTPID CORE

and the effect is to spawn a new Snabb process that simply executes an
app network and reacts to configuration changes.

The worker process looks in a "group/" shm folder of the PARENTPID to
keep track of its current configuration.

This is intended as an internal interface that the engine can use to
parallelize an app network by delegating parts to different child
processes. The processes would be managed directly by the engine and end
users need not be aware of the "snabb worker" interface.
@lukego
Copy link
Member Author
lukego commented Sep 29, 2016

I have pushed the next step now: a snabb worker ... program.

This is intended as an internal interface that a parent Snabb process can use to spawn a child process that will simply execute an app network on a dedicated core & react to configuration updates. (On balance I think that fork() without exec() is too messy and risks inheriting a lot of state that we don't want the child processes to have.)

The processes cooperate using a shared shm folder called group/. This is created by the parent process and then aliased into each child process (as a symlink).

The shm objects that I anticipate having in the group folder are:

group/myworker/config.lua       # current configuration for myworker
group/myworker/configs.counter  # incremented by parent on new config
group/dma/7e42ca00              # DMA HugeTLBs shared by all processes
group/pci/01:00.0/...           # Objects made available by driver

Here is how this would work in NFV with 100G (ConnectX):

  • Master process initializes the NIC, sets up switching, allocates all descriptor rings, polls counters (non-realtime activities).
  • Worker processes are each run a separate app network (serving different VMs) provided by the master process (based on the configuration file).
  • Worker process drivers attach to the NIC via descriptor rings (allocated in shared DMA memory & with addresses provided as shm objects).
  • DMA memory would stay allocated until the whole process tree shuts down (cleaned up by the supervisor of the master process).
  • Supervisor would also automatically disable PCIe bus mastering (DMA) for all PCI devices. This is a safeguard against the descriptor ring memory being reused while still referenced by a NIC.

Whaddayareckon?

Next steps...

  • Add a SIGSEGV handler to automatically map DMA memory from other processes (have prototyped this before);
  • Add the engine API for running multiple app networks in subprocesses.

@lukego
Copy link
Member Author
lukego commented Sep 30, 2016

(Could be overkill to make the worker into snabb worker program. I'll try factoring it simply into a core.worker module.)

Having a complete program for the worker process is overkill. This will
be replaced by a core.worker module.
This module adds a simple API for spawning "worker" child processes that
each run an app network on a dedicated core. The app network is provided
and updated by the parent process.
@lukego
Copy link
Member Author
lukego commented Sep 30, 2016

I pushed a refactoring where a simple core.worker module implements both the parent process and the child process code. This is currently sketch quality (compiles but won't run). Currently creating the worker with fork() but will switch this to fork()+exec() with some simple entry point (maybe snsh).

I have a small obstacle to overcome and I'm curious for @eugeneia and @wingo point of view because this is a precursor of sorts to the YANG work in #987.

The problem is: How do I serialize a config object to pass it from the parent process to the child process?

Today we already have functions like lib.store_conf() and lib.load_conf() that can serialize Lua objects to text files (pretty-print them and then parse them). The catch is that a config object includes Lua objects that cannot be directly serialized, particularly app classes which are typically tables of functions and closures.

So how should we serialize a config object? One idea would be to define a canonical string name for each app and then to use this to indirectly refer to the class object.

Could also be that there are related challenges e.g. places where we use app configuration tables that include non-serializable objects though I cannot immediately think of any.

Just now I have punted on the problem by calling non-existent config.save() and config.load() functions.

Obsoleted by core.worker module.

This reverts commit a85017b.
Resolved conflict in memory.lua which now has a lock on nr_hugepages to
prevent a race when provisioning new ones.
-- Start a worker process with affinity to a specific CPU core.
-- The child will execute an app network when provided with configure().
function start (name, core)
local pid = S.fork()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic: should be assert(S.fork())

@eugeneia
Copy link
Member

@lukego Hah, sorry for removing alias like a couple weeks ago. ;-)

I think the YANG effort will require a way to serialize configs as well, so it seems sensible to expect to have config.save and config.load.

I like the code in this PR very much! 👍

lukego added 4 commits October 2, 2016 17:22
Previously there was an implicit dirname() that skipped the last path
component. Removed this restriction to make the subroutine more useful.
HugeTLB memory allocated for DMA is now kept accessible in file-backed
shared memory. This makes it possible for process to map each others'
memory. The new function shutdown(pid) removes file-backing for a
process that is terminating (in order to free memory for the OS).

More specifically:

Allocating a hugetlb page with address xxxxx creates:
  /var/run/snabb/hugetlb/xxxxx.dma         File backed shared memory
  /var/run/snabb/$pid/group/dma/xxxxx.dma  Symlink to above

and the shutdown(pid) function will follow the symlinks to remove the
file backing. (The reason for symlinks is that these directories are on
separate filesystems, one hugetlbfs and one tmpfs.)
The new function shutdown(pid) will disable PCI bus mastering (i.e. DMA)
for all PCI devices that were in use by the process (and its process
group).

This is intended as a safety mechanism to prevent "dangling" DMA
requests on a device after a Snabb process terminates and the DMA memory
is returned to the kernel.

To keep track of which devices are in use we now create shm files with
names like "group/dma/pci/01:00.0" for each device that we have enabled
bus mastering for. These are the devices that are reverted by shutdown().
Call pci.shutdown() and memory.shutdown() when terminating the main
Snabb process (not its worker children).

The pci.shutdown() call will disabled DMA for all PCI devices that were
in use. The memory.shutdown() will then delete the DMA memory backing
files so that the memory will be returned to the kernel (after all
processes that have mapped the memory terminate).
@lukego
Copy link
Member Author
lukego commented Oct 2, 2016

I have pushed some more code. The overall intention is for a group of Snabb processes (the main one + the workers) to be able to automatically share DMA memory between themselves, and to implement clean shutdown semantics where DMA is disabled for all PCI devices before any DMA memory is freed.

Here are the specific invariants that this code aims to create:

  • Each HugeTLB allocated by a running group of Snabb processes is accessible to each process via the shm path group/dma/xxx.dma. (xxx is the physical address of the memory.)
  • DMA memory is never freed while any PCI device has bus mastering (DMA) enabled by a process in the group.
  • DMA memory is always freed during shutdown of a Snabb process group i.e. backing files on hugetlbfs do not outlive the processes.

The next step is to make DMA memory pointers globally valid for all Snabb processes in a group. This will be achieved with a SIGSEGV handler that detects access to unmapped DMA memory and automatically maps it in (if the address belongs to DMA memory allocated by a process in the group). I will use this immediately in the Mellanox driver so that the main process can create descriptor rings and then worker processes can easily access them.

Added a SIGSEGV handler that attempts to automatically map addresses
corresponding to DMA memory. The mapping succeeds when the DMA memory
was allocated by any Snabb process in the process group. If the mapping
does not succeed then the standard behavior of SIGSEGV is triggered.

Includes a simple unit test in core.memory to allocate DMA memory, unmap
it, and then access it anyway. Confirms that the signal handler runs the
expected number of times.
@lukego
Copy link
Member Author
lukego commented Oct 3, 2016

I pushed the SIGSEGV handler now. This adds another invariant:

  • DMA memory allocated by any process is automatically usable by every other process in the same group.

This means that any address returned by memory.dma_alloc() can be shared between processes. You have to be a bit careful about this, e.g. if you pass a packet from one process to another then it may end up on the wrong freelist when you are done with it, and you would need to take care of this e.g. as in #601.

The immediate benefit is that the Mellanox driver will be able to allocate all of the descriptor rings in the parent process and then make the addresses available to worker processes via shm objects.

@lukego
Copy link
Member Author
lukego commented Oct 3, 2016

Overall I feel like this branch represents a pretty reasonable approach to multiprocessing. The tricky bits we need to review are potential race conditions in the interactions between processes e.g. are configurations created atomically, are notifications (e.g. counter increments) made after all data is available, are all combinations of signals and termination orders handled equivalently, etc.

I suppose we also need a clear description of this new "process group with workers" concept in the manual. This concept as evolved a little during implementation.

lukego added 3 commits October 4, 2016 10:11
The new environment variable SNABB_PROGRAM_LUACODE can be used to force
a new Snabb process to execute a string of Lua code instead of running
the standard program-selection logic.

This is intended to support Snabb processes that want to spin off helper
processes that need to execute specific code and be independent of the
usual command-line syntax (e.g. sensitivity to argv[0] etc).
Worker processes now use execve() to create a totally new process image
instead of using the one inherited by fork(). This is intended to
simplify the relationship between parent and worker i.e. every process
is an independently initialized Snabb instance.

Uses the SNABB_PROGRAM_LUACODE hook to tell the worker process how to
start.
This is currently very basic because not all of the worker machinery is
working yet.
@lukego
Copy link
Member Author
lukego commented Oct 4, 2016

I pushed an update so that the worker processes use execve(2) to create an entirely new process image. This way the worker processes are all normal Snabb instances that have been initialized from scratch. This is intended to avoid a whole class of errors from inheriting state with fork().

Added an initial selftest function too.

The main thing lacking now is a working way to serialize a configuration object so that the parent can actually provide configurations to the children.

Overall I feel like the design here is quite reasonable and the code is short, but the exact formulation has quite some rough edges. For example I have somewhat awkwardly broken the shm abstraction in order to operate directly on the underlying directories (e.g. to create symlinks to HugeTLB files that live in a directory that is not addressable by shm paths).

@eugeneia
Copy link
Member
eugeneia commented Oct 6, 2016

So this is the simplest way I could think of to “serialize” app classes: lukego/snabb@multiprocess-engine...eugeneia:multiprocess-engine (I feel that I proposed this exact syntax before, but to which occasion?)

It extends core.config to accept app classes as string identifiers in the form "<module>[/<var>]" and updates core.app accordingly. The main caveat is that the programmer has to ensure he only uses string identifiers for classes when using core.worker. We could just stop supporting app classes passed as Lua objects, but that seems harsh.

Another issue is that this change conflicts with #1019, which uses class.config to reject invalid app arguments during config.app(!). I chose to parse app arguments in config because I felt it was a good idea to fail fast, but this could be done in engine.configure as well.

If I come up with something better I will let it be know. :-)

Fix a regression where the memory module would not retry hugetlb
allocations. pcall() was needed to trap allocation errors instead of
letting them propagate.

The old allocation primitive was written in C and always return NULL on
error but the new one is written in Lua and uses assertions.
@lukego
Copy link
Member Author
lukego commented Nov 16, 2016

How about now with the fix on 987dd77?

I believe the problematic case was when Snabb needs to ask the kernel to provision a new huge page after an allocation failure. This would tend to trigger if you have never run a Snabb process since boot (or after you run sysctl -w vm.nr_hugepages=0).

@kbara
Copy link
Contributor
kbara commented Nov 16, 2016

Several hundred test runs later, that's looking a lot like a fix, thank you. What's the basic idea behind retrying these allocations, by the way?

@lukego
Copy link
Member Author
lukego commented Nov 16, 2016

Good question. This is a DWIM feature to make Snabb easier to get started with. (Could be that it should be disabled for programs that are more concerned about correctness than convenient in installation.)

Just now you can install and run Snabb on a new machine very easily:

git clone https://github.com/snabbco/snabb
make -C snabb
snabb/src/snabb packetblaster -S 60 01:00.0

and if your machine was not already perfectly prepared then you may see some messages like this:

[mounting /var/run/snabb/hugetlbfs]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 0 -> 1]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 1 -> 2]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 2 -> 3]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 3 -> 4]
[memory: Provisioned a huge page: sysctl vm.nr_hugepages 4 -> 5]

This is Snabb attempting to DWIM instead of exiting with error messages like:

Error: Failed to provision a huge page.

Please take the following actions:

1. Reserve huge pages in your grub.conf:
      hugepages=nnn
   (where nnn is the max number of pages you will need for Snabb.)
2. Create directory /mnt/hugetlbfs
3. Add this entry to /etc/fstab:
      hugetlbfs       /mnt/hugetlbfs  hugetlbfs       defaults        0 0
4. Reboot.

which feels like creating a lot of schlep work for the user.

(You could also say that it is a workaround for the Linux kernel lacking a suitable mechanism for dynamically allocating a hugetlb page, at least that I am aware of.)

@lukego
Copy link
Member Author
lukego commented Nov 16, 2016

(Could add a DWIM configuration option to the engine to control this behavior. Once the YANG support lands perhaps we can model the engine config that way?)

This is a downstream cherry-pick of this change:
justincormack/ljsyscall#205

Required to prevent a double-close problem in memory.lua.
@lukego
Copy link
Member Author
lukego commented Nov 26, 2016

@kbara Please pull this branch again to get commit 7adc708 which fixes a file descriptor double-close bug in memory.lua. This can have unpredictable effects - was seen to asynchronously close vhost-user sockets in the NFV application. See also justincormack/ljsyscall#205.

@lukego
Copy link
Member Author
lukego commented Nov 26, 2016

This branch depends on #1077. Please merge @kbara. I did not pull here because it is based on a different ancestor (newer version of master) and would make this PR much more noisy to pull that too.

@lukego
Copy link
Member Author
lukego commented May 17, 2017


-- Run the engine with continuous configuration updates
local current_config
local child_path = "group/config/..name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petebristow notes that this looks like the wrong thing (and if it isn't it needs a comment I think): #1133 (review)

WDYT @lukego?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahem! That whole init() function is actually dead code. Removed in 876addc.

That's from an older version where core.worker imposed specific behavior on the worker process i.e. running an app network provided by the parent. These days the worker simply links itself into the process tree and runs the provided Lua code.

@wingo
Copy link
Contributor
wingo commented May 17, 2017

You will need Igalia#809 at least if this is to work with intel_mp. Do you want to make a separate merge branch that is based here, then merges in master, then pr's back to master?

@wingo
Copy link
Contributor
wingo commented May 17, 2017

Probably you want the worker.lua part of 3e32819 as well.

@wingo
Copy link
Contributor
wingo commented May 17, 2017

You might also need 876addc

lukego added a commit to lukego/snabb that referenced this pull request May 22, 2017
@lukego lukego changed the title [RFC] [WIP] Engine support for multiprocess orchestration Engine support for multiprocess orchestration Jun 2, 2017
@lukego lukego added merged and removed rfc labels Jun 2, 2017
@eugeneia eugeneia merged commit 6d7e516 into snabbco:master Jun 23, 2017
dpino pushed a commit to dpino/snabb that referenced this pull request Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0