8000 doc: add TRD: Dynamic Process Loading by bradjc · Pull Request #4338 · tock/tock · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

doc: add TRD: Dynamic Process Loading #4338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

bradjc
Copy link
Contributor
@bradjc bradjc commented Feb 12, 2025

Pull Request Overview

This pull request complements #3941 by adding a TRD for the new dynamic process loading architecture.

The main goal of this document is to describe what exists and how the traits are distinct from the implementation.

Testing Strategy

doc

TODO or Help Wanted

n/a

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make prepush.

Copy link
Contributor
@brghena brghena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments for you. I think this mostly makes sense to me and seems pretty reasonable.

```rust
trait DynamicProcessLoad {
/// Request kernel to load the newly stored process.
fn load(&self) -> Result<(), ErrorCode>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange to me that load() doesn't have some ID or reference or something to signify which application binary is being loaded and tie it to the prior setup/write work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it certainly could, and we have discussed that, but having multiple apps in flight at once is not something we have experimented with. We aren't sure what type of ID to use, and we didn't want to add additional complexity that we don't have a good way to test/motivate.

We decided to restrict to a single in-flight new process binary for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't fully understand is how the "internal state" of this trait relates to the "internal state" of the DynamicBinaryStore trait. If I understand correctly, a call to load here is only valid after DynamicBinaryStoreClient::finalize_done --> Success (and the most-recently-finalized process is what is loaded)?

I don't like that there is implicit state across traits. The problem was hidden when AppLoaderCapsule exposed just the load() method (which internally would do finalize+load) — but once we separate out finalize and load, now you have the problem where UserspaceLoaderProcessA has the "update-at-midnight" policy, and sets things up a few hours early, then at 10pm UserspaceLoaderProcessB comes along and just does a quick update&load of a less-sensitive app. When midnight happens UserspaceLoaderProcessA has no way to load its new process.

If we want to avoid the questions around identifier selection, I think there needs to be one trait that encompasses binary update and loading, and a "app update session" commences with start and doesn't finish until load (or abort) is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I like the current conceptual split across the two traits, I think it makes sense.

Looking at the app id TRD (which you certainly know better than me...), it's clear that Application Identifier is insufficient.

I might imagine an ID defined by the address in storage the app was loaded at, but that's not fully thought out and hits the 'thorny problems' identified here :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been trying to do app loading for years now. Is it ok to just have a version that only supports one app loader and one new app in progress? We can always change things.


One reason there was no finalize is I see "load()" as "I'm done with this, it's in your hands now, kernel". The kernel can do things like check credentials before running the process; calling load() is not a guarantee that the app will actually execute. So, if the kernel supports policies about when a process can start running it is free to enforce them once load is called.

So why have load() at all? There is still something nice about separating storing a binary from deciding to try to run new apps. load() could be thought of as "load all", just as main.rs does today.


Most boards just use a function called load_processes() that doesn't take any sort of process identifier and that hasn't been an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been trying to do app loading for years now. Is it ok to just have a version that only supports one app loader and one new app in progress? We can always change things.

Sure, we should just merge into one mega-trait for the time being then, and we can split out the two concepts when we figure out how to do it.


So, if the kernel supports policies about when a process can start running it is free to enforce them once load is called.
So why have load() at all? There is still something nice about separating storing a binary from deciding to try to run new apps.

Hmm.. maybe we are talking past each other in use of language here. There are

  1. Get the data bits for a process into local storage.
  2. Verify a process is eligible to execute on the platform (check signatures, policy, etc)
  3. Set up relevant data structures to actually execute a process (copy data into ram, etc)
  4. Schedule a process to begin executing 

In current Tock, (1) is always 10000 done in advance; now we are adding it as "store".
(2), (3), and (4)* happen to be done monolithically by the current process standard implementation during "loading".

I had been viewing (2) as the last step of something which is responsible for "getting a new process physically onto the board". i.e., My job as the updater is done when there are bits on disk that are eligible to run on the platform. This is finalize(). Only "static" (i.e. permanent storage) resources are consumed at the completion of this step.

I had been viewing (3) as the request for the kernel to allocate space in SRAM for this process; a slot in the processes array; etc. This is when "dynamic" resources are consumed on behalf of the process — critically, these may be scarcer. This is how I had been viewing load().

In contrast, it sounds like your load() is (2+3).

In the "critical app swaps at midnight" scenario in my head, there are enough storage resources on a platform to hold versions A and B of an app, but not enough resources to simultaneously run both versions A and B. To set up the "swap", it's necessary to have B at a state where you know it is able to be run, i.e., (2) is already done.

We haven't needed this distinction before because historically we just do (1-4) at boot for all processes. Once we have DPL though, I think each of these are meaningfully separate steps.

*[yes, technically a process is created in Yielded with work in its queue which the kernel loop then switches to Running, but I think "not-(4)" would be creating a process in the Stopped state]


Most boards just use a function called load_processes() that doesn't take any sort of process identifier and that hasn't been an issue.

Sure, because that probably should've been called load_all_processes(); we've never had the capacity* (or use case) to load just one process before now.

*(publicly at least, load_process does exist in kernel::process_loading; even the async interface only allows "loading all", it kicks of discovering all processes and then loading all processes, you only get "per-process control" when you're notified that one of the suite it's working on happens to have finished)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue is already addressed in the doc. §5 talks about how we utilize the SequentialProcessLoaderMachine's capability to load the process and perform the credential check. The current checker implementation uses the ProcessBinary object which we do not create until we need to load it.

For your vision of splitting things, we need to separate the credential check from the loading, affecting how the check is performed for all processes, whether loading at boot or dynamically. This feels like a change that can happen in the future.

Currently finalize() checks if we need to write a leading padding app to preserve the linked list, and then writes one if needed. The current implementation of finalize() acts as a "I am done storing what needs to be stored" checkpoint.

Additionally, the "critical app swaps at midnight" feature cannot currently work. We do not have a mechanism to disable apps dynamically. We want this feature, but it is currently impossible, so worrying about this issue is unnecessary. I hope to add that mechanism eventually.

The dynamic process loader implements a new method load_new_process_binary() which does not discover all processes. It only discovers the process binary located at the slice of flash allocated during setup() and operates on it. If anything, I came up with an implementation where the 'identifier' of sorts was the starting address and the length of the app for the load mechanism. You can take a look here in #4335.

In this implementation, there is an optional argument for the load() method which lets an application request the kernel to load a binary stored at address X with size Y, or it could request to write an application binary with size Y and then load it. The downside to this approach is that we require the external app to know the position of the new binary, and that was not something we were entirely sure we wanted, so we scrapped the idea.

We separated the traits because while we are using the traits in conjunction in this implementation, they are two separate things, storing and loading. Therefore, it makes more sense to split them. If you look at commit 99423b0, they were a single implementation. After that, a conscious decision was made to split them because in commit 287b2d6, you'll see the optional argument for the load() mechanism. We played around with it for a bit (refer to #4335 again). It is not hard to implement how load() should happen, but it is hard to decide what goes in. Maybe we could opt in for a policy based approach as well. As of writing, the PR we did not have a clear idea, so we decided to make the whole process monolithic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to split step 2, like so:

  1. Get the data bits for a process into local storage.
  2a. Verify a process is eligible to execute on the platform (check signatures, policy, etc)
  2b. Verify a process is unique.
  3. Set up relevant data structures to actually execute a process (copy data into ram, etc)
  4. Schedule a process to begin executing 

2a could be done in advance, most likely. 2b really must be done in conjunction with 3, because a process may not be unique by the time 3 happens. It might be safer to make finalize() 1, and load 2-4.

But I think trait separation is something that is non-obvious and harder to do later. I think if either of us were to expand dpl to more fine grained loading, or add a policy for determining if/when to load a new process, we could do it with separated traits. But if others were to take up the torch, it is not obvious we would want a new trait.

What is the minimum DynamicProcessLoad trait you would be comfortable with? Should load() take an AppId? Conceptually that would work and is probably correct. However, it is difficult to know, in general, what an app's AppId actually is, since that is up to the kernel's policy. Should load() take an opaque identifier, maybe a usize, provided by setup()? I think that could work as we could just use the starting flash address.

I think adding an argument to load() is too speculative right now. Really, someone needs to really think through the delayed update case and come up with a full design.

If the only way to get this merged is to remove DynamicProcessLoad we'll do it. Do you want the TRD to remove its existence? Should it be there in the architecture, but not in the implementation? What is the shortest path forward?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to preface this with "I am honestly unsure of the best (read: most reasonable/realistic) course of action".

What I'm realizing is that many of my concerns about process loading and are not unique to dynamic process loading. Unfortunately, trying to think deeply about a TRD and the right design for dynamic loading raised some thoughts on how we do loading generally which aren't really DPL's problem.

I think the ideal scenario would be to draft a TRD which is about process loading generally, and then one on dynamic loading. Actually, I think dynamic loading is revealing some imperfect assumptions which came from static loading. Maybe the ideal scenario is expanding the scope of this TRD to "process loading", where dynamic is one of the options for how it is done.

Now, that's a much bigger lift, and I don't think any of that should be in the way of the working DPL implementation that we now have (indeed, we have a [static] process loading implementation without any design doc currently). I might propose moving #3941 forward ahead of this TRD. We can continue to hash out with the TRD process what an ideal design might do, and tackle some of the thornier questions like "what is the identifier for an app image". When that process is done, then we can go back and update (all) the process load implementation(s).


For the traits specifically, what makes me uncomfortable is the implicit tight coupling — i.e., it would be hard if not impossible to have the two separate traits not be implemented by the same object (because the two objects would have to synchronize the state of the "process loading session"). However, we don't have the identifier we need to cleanly decouple these right now. I think just noting that limitation in the current implementation, i.e. something like

// Conceptually, these are two separate operations. However, Tock does not currently
// have a robust identifier for app images in storage, so the current implementation
// couples storing a new binary and loading it into one atomic session. See #4338 
// for work on enabling separate implementation of these traits.
pub trait DynamicProcessStoreAndLoad: DynamicBinaryStore + DynamicProcessLoad {}

and then have SequentialDynamicBinaryStorage implement DynamicProcessStoreAndLoad.

Comment on lines 174 to 179
There is a coupling between the `setup()` and `write()` calls. The `setup()`
call allows the implementation to allocate the needed resources to store the
process binary. Because the kernel requires that it reserves the resources
required for the new binary before storing it, this method MUST be called before
calling `write()`. An implementation is responsible for storing individual
chunks of the process binary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also coupling between setup() and load() right? As a new setup() cannot be called until either an abort() or a ?successful? load(), I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new setup can be called if any of the steps fail. We release resources and make it available for a new process. However you are right in that a setup cannot be called again after write. It can only be called after:

  1. A successful load()
  2. abort()
  3. Some process along the line failed and we reset the dynamic process loader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting that list would be helpful.

Comment on lines +238 to +249
trait DynamicBinaryStore
trait DynamicProcessLoad
┌────────────────────────────────┐ ┌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┐
│ │ ╎ ╎
│ ──┼──►╎ ╎
│ │ ╎ ╎
│ SequentialDynamicBinaryStorage │ ╎ SequentialProcessLoaderMachine ╎
│ │ ╎ ╎
│ │ ╎ ╎
│ │ ╎ ╎
└────────────────────────────────┘ └╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┘
hil::NonvolatileStorage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this diagram is trying to show me? What does it mean that trait DynamicBinaryStore is on top of trait DynamicProcessLoad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both exposed by SequentialDynamicBinaryStorage, which itself has a reference to SequentialProcessLoaderMachine.

Comment on lines +280 to +285
Second, `SequentialDynamicBinaryStorage` does not allow the calling capsule to
write a portion of the first eight bytes of the process binary (where the
`total_size` field is located). It must write the entire region, and
`SequentialDynamicBinaryStorage` ensures those first eight bytes are correct
(i.e., they match the size of the process binary and use a valid TBF header
version).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the calling capsule write a portion of the first eight bytes? Does that somehow stop the implementation from ensuring that the first eight bytes are correct? Couldn't the same check occur either way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the first 8 bytes are NOT correct? Is the write rejected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an implementation decision to make the implementation simpler. It certainly could be more flexible, but that additional flexibility doesn't seem worth it in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, if the first 8 bytes are NOT correct, the whole process is aborted in the current implementation.

Copy link
Contributor
@brghena brghena Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call setup() again? Or just a new write? Documenting that would be good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is you do not want to trust something that wants to manipulate the header, so you revoke all promises. The promise is the space allocated based on what is requested.
Two things expected from whatever is writing the binary are:

  1. Don't lie about your size (total_size should match the size declared during setup)
  2. Don't try to write out of the region promised to you (if an app requests for a size of 100 bytes and is assigned an address 0xN, it is illegal for the app to attempt to write data at 0xN+101)

In both of these cases, we treat it as an offender and end the contract to load a new app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the logic behind that policy. I do think it becomes important for §4.1 to have some articulation more directly of exactly when a "process loading session" is aborted. It can't be e.g. "any error", since this is agnostic to the underlying storage, and it's perfectly valid for some naive storage driver that happens to be servicing some other request to return EBUSY, in which case the write just needs to be retried later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like something the storage implementation would return the error for via the asynchronous callback. We pass the operation's result with the upcall so the userspace application can retry if it wants.

The write() actions in the untrusted capsule and kernel capsule exist to check if the write operation is valid (checking header integrity, address integrity, if the buffer is available, etc.). Once this is cleared, we pass the write to the underlying storage mechanism driver (in this case, the NonvolatileToPages driver). Once the pass happens, we find out about the result via an async callback. Our current implementation of the nonvolatile driver only returns Ok(()), but an implementation could return other forms of errors that are passed back via the upcall. The application can retry then if it wants, and we do not abort anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write() actions in the untrusted capsule and kernel capsule exist to check if the write operation is valid (checking header integrity, address integrity, if the buffer is available, etc.).

Are these all synchronous operations? i.e., is it the case that any write that would cause the whole transaction to be cancelled will receive an error back to the Command::write synchronous call? (And do all synchronous errors abort the transaction?)

What I'm trying to get to here is "how does the caller know whether an error from a write command has aborted the transaction" ?

Copy link
Contributor
@viswajith-g viswajith-g Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the operations I mentioned are synchronous. They are part of the Result returned when an app requests a write operation. Any synchronous error results in abort because it is usually a violation. Other than perhaps the buffer being unavailable, I can amend the call to not abort on just ERESERVE.

Comment on lines 150 to 157
/// Initiate storing a new process binary with the specified length.
fn setup(&self, length: usize) -> Result<usize, ErrorCode>;

/// Store a portion of the process binary.
fn write_process_binary_data(&self, buffer: SubSliceMut<'static, u8>, offset: usize) -> Result<(), ErrorCode>;

/// Call to abort the setup/writing process.
fn abort(&self) -> Result<(), ErrorCode>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll expand on this comment later when I have more time, leaving this as a bookmark for the discussion on explicitly aborting, vs. explicitly finalizing. I think that adding a finalize instead of an abort is a better interface as it's easier to reason about (esp. in an intermittent context) and allows more flexible underlying implementations, such as ones that don't flush on every call to write_process_binary_data or ones that want to perform some addl. checks before "enabling" an app in flash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was discussed elsewhere, but abort() and finalize() seem complementary. I think adding finalize() helps decouple the two traits and I will add it to the TRD.

Copy link
Member
@ppannuto ppannuto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, will read the rest asap

Comment on lines 115 to 116
to be able to store and load a process of the specified size. Success or
failure is indicated via an upcall.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this return any kind of handle or indicator for which process is being setup (and equivalently, which one should be written to by write), or is the design such that an AppLoader process only has one app update in flight at a time?

(Or, is there an assumption that there is only one AppLoader process, and the platform can only have one app update in flight at a time?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading more, it's more clear that there's one app update in flight from a given userspace process. We do have some capsules/interfaces which are designed to be "owned" by a single process—is that the expectation here? Or can there be multiple AppLoader processes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this came up on the call — it sounds like the answer is in the middle: multiple apps can try to use this capsule, but only one app can have an "active" load going at once.

I think the documentation for setup() needs to make this more explicit, and state that calling the syscall "reserves" (or similar verbiage) the process loading mechanism for a given app load attempt which is then free'd on a call to close/finalize/load(). If a second process tries to start a process load while another is active, it will return EBUSY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is that a userspace process can only be loading one app at a time. That seems like a reasonable tradeoff.

The current implementation of the apploader capsule only allows one process to use the apploader capsule. That also seemed like a reasonable current implementation, eg Apple gets away with only having one app store app.

However, the syscall design does not prohibit a future implementation from being more flexible (eg allowing for the google play app store and an amazon app store). Each app store can still only load one app at a time.

Comment on lines 118 to 121
This stores a portion of the process binary. Write is expected to be called
multiple times to store the entire process binary and there is no assumption
about the order the process binary is written in. Success or failure is
indicated via an upcall.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This stores a portion of the process binary. Write is expected to be called
multiple times to store the entire process binary and there is no assumption
about the order the process binary is written in. Success or failure is
indicated via an upcall.
This stores a portion of the process binary. `setup()` mus have completed
successfully before `write()` can be called (otherwise the interface will
return `ERESERVE`). Write is expected to be called multiple times to
store the entire process binary and there is no assumption about the
order the process binary is written in. Success or failure is indicated via
an upcall.

multiple times to store the entire process binary and there is no assumption
about the order the process binary is written in. Success or failure is
indicated via an upcall.
3. `load()`: This indicates the entire process binary has been written and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we should separate concepts of close() (or finalize() I think is what Leon was getting at on the call) from load() — e.g., I might imagine a system that wants to do some kind of "switch at midnight" policy getting the new app on-board and ready to go before it actually wants to load the new app? [Ideally close()/finalize() would include verifying the signature, etc probably]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still convinced that closeing / finalizeing a process binary that is being written to flash is important (e.g., to maintain some guarantees on the flash contents, such as ensuring that applications are always a well-formed linked list, and the kernel does not attempt to load a partially written application).

However, I'm fine leaving these two concepts entangled for the capsule API: presumably, there are few use-cases where applications are installing other applications into flash, and then not immediately loading them. Furthermore, a simple chip reset or power outage can restart the chip at any time, which would cause that application to be loaded, independent of whether the loader-application has explicitly requested this.

3 System Call API
=================================

The `0x10001` system call interface provides four operations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta: I think what might make this clearer is changing Success or failure is indicated via an upcall. for each of these below to something more explicit, i.e.

Setup will always be followed by an upcall:
 - `Success` means that this process now has an active process loading session ready to write data.
 - `EBUSY` means another process is currently using the process loading mechanism.
 - `ESIZE` means the there is not sufficient storage available to load the requested process.

^ this should be exhaustive of the errors setup() can return ideally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that level of detail should be in the syscall doc. I intentionally avoided doing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, why is the syscall interface in the TRD at all? I went to double-check myself before posting this, and none of the other TRDs which describe kernel behavior discuss the syscall interface (indeed, only 104 and 106 which are explicitly about syscalls discuss syscalls).

I don't think there is anything specific to app loading in the differing trust between the Userspace Application and Apploader Capsule above; that's just the normal app/capsule trust division. The document even notes later that it's not required for binary application acquisition to happen in userspace.

In practice, I think the Apploader Capsule is basically just a passthrough capsule, exposing the underlying kernel interfaces to userspace.

I think §3 should be removed from this TRD (and moved into syscall documentation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the architecture is pretty important in this case. This isn't simple to do (although hopefully the description is clear now that it exists). Choosing to do what is on the surface a very low-level operation with userspace applications is, I don't think, obvious.

The interfaces are mirrored, true, but we needed to get something working. I don't think they will remain mirrored (e.g. adding decompression).

Most of our TRDs are about HILs, which are different. More complex TRDs, like AppID, have to cover more (e.g. TBF headers).

bradjc and others added 3 commits March 5, 2025 14:56
Co-authored-by: Pat Pannuto <pat.pannuto@gmail.com>
Add finalize and clarify different operation connections.
Comment on lines +148 to +150
correctly upholds the Tock Threat Model guarantees. Specifically, storing a new
process binary must not cause existing process binaries to be "lost" or not
loaded on future reboots of the board.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only covers the integrity half of the threat model, I think this also either (a) needs to elucidate thoughts on confidentiality, or more simply (b), just refer to the Threat Model, i.e.:

Suggested change
correctly upholds the Tock Threat Model guarantees. Specifically, storing a new
process binary must not cause existing process binaries to be "lost" or not
loaded on future reboots of the board.
correctly upholds the Tock Threat Model guarantees.

Comment on lines +227 to +230
This interface does not mandate that the kernel capsule creates a new process.
Implementations may include a policy for choosing whether to load a new process
binary. The implementation must also use the board's chosen credential checking
policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the phrasing here, the subsection opened by saying that this is

an interface for loading a stored process binary into an active Tock process.

What is the point of an interface to create an active process that is not mandated to create a process?

Is it trying to say something more like this?

Suggested change
This interface does not mandate that the kernel capsule creates a new process.
Implementations may include a policy for choosing whether to load a new process
binary. The implementation must also use the board's chosen credential checking
policy.
Even if an application image was written successfully, the request to load a process
may fail. The platform application loader may include a policy for choosing whether
to load a new process binary, which may include steps such as validating the
credentials of the newly requested process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some combination. It may fail, and it must respect the credential checking policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a combination of those two make sense. I needed to re-read the text in the proposed TRD a couple of times to understand it; in particular the first sentence confused me. It helps that @ppannuto's version talks about "failing", before talking about the result ("not creating / loading a process").

Comment on lines +280 to +285
Second, `SequentialDynamicBinaryStorage` does not allow the calling capsule to
write a portion of the first eight bytes of the process binary (where the
`total_size` field is located). It must write the entire region, and
`SequentialDynamicBinaryStorage` ensures those first eight bytes are correct
(i.e., they match the size of the process binary and use a valid TBF header
version).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that an invalid write can (effectively, literally?) implicitly call abort()?

I think the details of that needs to be explained in §4.1... would you get the abort_done callback in response to a write? Or would you have to know that there are error codes from write_done that are semantically equivalent to receiving abort_done?

10000

Comment on lines +280 to +285
Second, `SequentialDynamicBinaryStorage` does not allow the calling capsule to
write a portion of the first eight bytes of the process binary (where the
`total_size` field is located). It must write the entire region, and
`SequentialDynamicBinaryStorage` ensures those first eight bytes are correct
(i.e., they match the size of the process binary and use a valid TBF header
version).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems complicated — wouldn't it be easier to just reject that one invalid write request, but otherwise leave things unchanged?

Co-authored-by: Pat Pannuto <pat.pannuto@gmail.com>
@lschuermann lschuermann self-requested a review March 12, 2025 16:26
@bradjc
Copy link
Contributor Author
bradjc commented Mar 13, 2025

For documenting process loading, does that mean describing these two traits:

pub trait ProcessLoadingAsyncClient {
/// A process was successfully found in flash, checked, and loaded into a
/// `ProcessStandard` object.
fn process_loaded(&self, result: Result<(), ProcessLoadError>);
/// There are no more processes in flash to be loaded.
fn process_loading_finished(&self);
}
/// Asynchronous process loading.
///
/// Machines which implement this trait perform asynchronous process loading and
/// signal completion through `ProcessLoadingAsyncClient`.
///
/// Various process loaders may exist. This includes a loader from a MCU's
/// integrated flash, or a loader from an external flash chip.
pub trait ProcessLoadingAsync<'a> {
/// Set the client to receive callbacks about process loading and when
/// process loading has finished.
fn set_client(&self, client: &'a dyn ProcessLoadingAsyncClient);
/// Set the credential checking policy for the loader.
fn set_policy(&self, policy: &'a dyn AppIdPolicy);
/// Start the process loading operation.
fn start(&self);
}

?

I'm unsure what other questions you are thinking of that would help to have answered in a TRD.

Copy link
Member
@lschuermann lschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a fresh, in-depth pass over the proposal, modulo Section 5. It looks good to me generally, apart from the existing comments and the new ones I added. In particular, I share the concerns about the implicit coupling of storing and loading (through two different traits), as raised by @brghena here:

https://github.com/tock/tock/pull/4338/files#r1978601588

multiple times to store the entire process binary and there is no assumption
about the order the process binary is written in. Success or failure is
indicated via an upcall.
3. `load()`: This indicates the entire process binary has been written and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still convinced that closeing / finalizeing a process binary that is being written to flash is important (e.g., to maintain some guarantees on the flash contents, such as ensuring that applications are always a well-formed linked list, and the kernel does not attempt to load a partially written application).

However, I'm fine leaving these two concepts entangled for the capsule API: presumably, there are few use-cases where applications are installing other applications into flash, and then not immediately loading them. Furthermore, a simple chip reset or power outage can restart the chip at any time, which would cause that application to be loaded, independent of whether the loader-application has explicitly requested this.


The `abort()` call deallocates any stored resources and resets the
implementation to handle a new `setup()` call.
`abort()` can only be called after `setup()` and before `finalize()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expand on what happens when a call to finalize fails? Are the resources deallocated, or can we not provide any guarantees in that case?

Comment on lines +227 to +230
This interface does not mandate that the kernel capsule creates a new process.
Implementations may include a policy for choosing whether to load a new process
binary. The implementation must also use the board's chosen credential checking
policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a combination of those two make sense. I needed to re-read the text in the proposed TRD a couple of times to understand it; in particular the first sentence confused me. It helps that @ppannuto's version talks about "failing", before talking about the result ("not creating / loading a process").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0