8000 Ephemeron references · Issue #303 · smlnj/legacy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ephemeron references #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Skyb0rg007 opened this issue Mar 12, 2024 · 11 comments
Open

Ephemeron references #303

Skyb0rg007 opened this issue Mar 12, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@Skyb0rg007
Copy link
Contributor

Description

From SRFI 124:

An ephemeron is an object with two components called its key and its datum. It differs from an ordinary pair as follows: if the garbage collector (GC) can prove that there are no references to the key except from the ephemeron itself and possibly from the datum, then it is free to break the ephemeron, dropping its reference to both key and datum. In other words, an ephemeron can be broken when nobody else cares about its key. Ephemerons can be used to construct weak vectors or lists and (possibly in combination with finalizers) weak hash tables.

These kinds of references are invaluable for memoization libraries, or any kind of library which builds and maintains a table of all objects in the library.

Ephemerons are implemented in the following languages/implementations, for ideas on naming and features:

@Skyb0rg007 Skyb0rg007 added the enhancement New feature or request label Mar 12, 2024
@JohnReppy
Copy link
Contributor

Can you suggest a signature for this? Perhaps it should be a Basis Library enhancement?

@Skyb0rg007
Copy link
Contributor Author

Here is one potential modification to the current WEAK signature that would support this idea.

signature EPHEMERON =
sig
  type 'a weak

  (* [ephemeron (k, v)] constructs a weak pointer with key [k] and value [v].
     This maintains a reference to [v] so long as [k] remains alive.
     Note that [k] is not considered alive if the only reference to [k] comes through this ephemeron's reference to [v]. *)
  val ephemeron : 'k * 'v -> 'v weak

  (* [weak v] is implemented as [ephemeron (v, v)].
      This maintains a reference to [v] so long as [v] remains alive through other means.
      The semantics are the same as the existing [Weak.weak]. *)
  val weak : 'v -> 'v weak

  (* [strong w] returns [NONE] if the key that was used to construct [w] has been garbage-collected.
      It returns [SOME v] if the key used to construct [w] is alive, where [v] is the value used to construct [w]. *)
  val strong : 'v weak -> 'v option
end

Example:

structure E : EPHEMERON

type id = unit ref
datatype node = NODE of {id : id, parent : id, data : int }

(* Simple ephemeron table mapping ids to nodes
   Note: This doesn't work with [Weak.weak], since you need an external reference to the data under the pointer itself.
   But here, you have external references to an [id], which will keep the [node] alive, even through the only way to get
   from an [id] to a [node] is though this table.
   Additionally, this allows for a node to refer to its own id, since the `value` can't keep the `key` alive through the ephemeron. *)
val nodes : node E.weak list ref = ref []

(* mkNode : id * int -> id *)
fun mkNode (parent, data) =
  let val id = ref ()
      val n = NODE { id = id, parent = parent, data = data }
  in  nodes := E.ephemeron (id, n) :: !nodes;
       id
  end

(* get : id -> node option *)
fun get id =
   let
     fun loop [] = NONE
       | loop (w :: ws) = case E.strong w
            of NONE => loop ws
             | SOME (node as NODE {id = id', ...}) => if id = id' then SOME node else loop ws
   in
     loop (!nodes)
   end

fun test () = let
  val rootId = ref ()
  val root = NODE { id = rootId, parent = rootId, data = 0 }
  val () = node := rootId :: !nodes
  val child1Id = mkNode (rootId, 1)
  val child2Id = mkNode (rootId, 2)
  in
     child2Id
  end
val child2Id = test ()
(* Assume no GC occurred *)
(* map E.strong (!nodes) = ref [SOME root, SOME child1, SOME child2] *)
val _ = doGC ()
(* No reference to `child1Id`, so it is collected. `root` is alive through `child2`, which is alive through `child2Id`. *)
(* map E.strong (!nodes) = ref [SOME root, NONE, SOME child2] *)
val child2Id = () (* ie. child2Id is now dead *)
val _ = doGC ()
(* No reference to `rootId`, so `root` is collected. Same with `child1`
(* map E.strong (!nodes) = ref [NONE, NONE, NONE] *)

@JohnReppy
Copy link
Contributor

I think that this mechanism (if implemented) should be a separate structure and type from the existing Weak structure. For one thing, ephemeron are more space expensive and more complicated to implement in the GC. I would propose

signature EPHEMERON =
  sig

    (* `ephemeron (k, v)` constructs a weak pointer with key `k` and value `v`.
      * This maintains a reference to `v` so long as `k` remains alive. Note that `k`
      * is not considered alive if the only reference to `k` comes through this
      * ephemeron's reference to `v`.
      *)
    val ephemeron : 'k * 'v -> 'v ephemeron

    (* `broken eph` returns `true` if the key that was used to create `eph` has been garbage collected and `false`
     * otherwise.
     *)
    val broken : 'v ephemeron -> bool

    (* `get eph` returns `NONE` if the key that was used to construct `eph` has been garbage collected.
     * It returns `SOME v` if the key used to construct `eph` is alive, where `v` was the value used to
     * construct `eph`.
     *)
    val get : 'v ephemeron -> 'v option

  end

SRFI 124 includes an operation for getting the key value, which would mean that the type constructor for an ephemeron would have to have two type arguments. It also have a "barrier" operation for pinning down a key until after the barrier is executed. I assume that this is a mechanism that guarantees that the compiler is not allowed to optimize away the reference to the key. Such a mechanism might be generally useful independent of ephemerons.

@Skyb0rg007
Copy link
Contributor Author

I would avoid functions such as broken : 'v ephemeron -> bool, since they inherently have a race condition (broken p may return true, but then the GC runs before you call get). Requiring a user to call Option.isSome (get p) is not a big deal if that is actually required. This is only needed for Scheme where you would otherwise be unable to distinguish a live pointer to #f from a broken pointer.

Allowing key access is not explicitly necessary from an API perspective, since you can always implement ('k, 'v) ephemeron as ('k * 'v) ephemeron. But it could be good to allow if not more expensive implementation-wise.

The reference barrier could be useful as a primitive. I know it exists in MLton in the MLtonFinalizable module as touch : 'a t -> unit. I'm not entirely sure why it would be needed unless you allow access to the key in the API.

@JohnReppy
Copy link
Contributor
JohnReppy commented Nov 9, 2024

I think that broken is kind of like pointer equality for SML, if it returns true, then you know something, but it it returns false, not so much. There is no data race, since the get operation will return NONE if the GC ran between the calls to broken and get.

Also, I am interpreting the statement

Note that k is not considered alive if the only reference to k comes through this ephemeron's reference to v

as

Note that k is not considered alive if the only reference to k comes through this ephemeron's reference to v and v is not otherwise reachable.

Furthermore, the notion that k is alive is only meaningful if k has object identity (i.e., it is a ref cell or similar). Perhaps the interface should be something like

signature EPHEMERON =
  sig

    type 'a key
    type 'v ephemeron

    val broken : 'v ephemeron -> bool

    val ephemeron : 'k key * 'v -> 'v ephemeron

    val getValue : 'v ephemeron -> 'v option

    val key : 'k -> 'k key
    val getKey : 'k key -> 'k
    val sameKey : 'k key * 'k key -> bool

  end

@Skyb0rg007
Copy link
Contributor Author

I think that broken is kind of like pointer equality for SML, if it returns true, then you know something, but it it returns false, not so much. There is no data race, since the get operation will return NONE if the GC ran between the calls to broken and get.

This seems backwards: if broken returns false, you've learned no information: the ephemeron may be dead the next time you look at it because the ephemeron could already be unreferenced. You've only learned information if broken returns true. I guess there could be a use but I think you'd agree it's easy to misuse accidentally by forgetting which return value gives you information. There's a reason why Unsafe.ptrEq should be in the "Unsafe" structure.

That signature may have a better notion of "object identity", but I don't think it matters for usability, and it will have performance impact (an additional allocation). Weak references have the same issues regarding object identity so ignoring that issue has precedence. If this is an important feature, the actual implementation could avoid keys and be under Unsafe.Ephemeron, and the implementation under SMLofNJ.Ephemeron could be the same implementation but with 'a key implemented as 'a ref.

@JohnReppy
Copy link
Contributor
JohnReppy commented Nov 11, 2024

W.r.t broken, I think that you are agreeing with me. true means that you learned something and false doesn't.

Weak references themselves have object identity, so it is not the same situation. What would it mean to have a int key value? Since the notion of a "reference to the key" is critical to the semantics, keys must have object identity; otherwise, the semantics does not make sense.

@Skyb0rg007
Copy link
Contributor Author
Skyb0rg007 commented Nov 11, 2024

Apologies, I misread your comment regarding broken.

With respect to the keys, having an int key can be given ambiguous semantics, the same way that Weak.weak has ambiguous semantics:

The semantics of weak pointers to immutable data structures in ML is ambiguous.

Also, the Haskell description of their GHC.Weak.Weak datatype may have a clearer definition of ephemerons to start the definition from.
Here is my new proposed description:

An emphemeron pair expresses a relationship between two objects, the key and the value: if the key is considered to be alive by the garbage collector, then the value is also alive. A reference from the value to the key does not keep the key alive by itself, though any reference to the key from the root set through the value will keep the key alive.
Note: The garbage collector's semantics with respect to immutable data structures is ambiguous. The key should be a data structure with identity.

The GHC implementation of ephemerons is something like the following since they don't have weak pointers, only ephemerons (I removed finalizers for this pseudocode example):

struct ephemeron {
  header_t header;
  value_t *key;
  value_t *value;
  struct ephemeron *_next;
};

void gc() {
  // ...
  struct ephemeron *ephemerons = NULL;
  vector<value_t *> worklist;
  // Keep looping until we have no more values to visit and have processed every ephemeron
  while (worklist.len() > 0 || ephemerons != NULL) {
    value_t *v = worlist.pop();
    // ...
    // Phase 1: mark everything as reachable, putting reachable ephemerons in a list for later.
    // ...
    if (is_ephemeron(v)) {
      ((struct ephemeron *)v)->_next = ephemerons;
      ephemerons = v;
    }
    // ...
    // Phase 2: Handle ephemerons
    for (struct ephemeron *e = ephemerons; e->_next != NULL; e = e->_next) {
      if (is_alive(e->key)) {
        visit(e->value); // copy/mark value as alive/visit the value
     } // else the ephemeron doesn't get copied and it dies if it isn't resurrected by the end of the main loop
    }
    // keep going so long as there are values in the work list or ephemerons in the linked list.
  } // end main loop
}

@JohnReppy
Copy link
Contributor
JohnReppy commented Nov 12, 2024

Regarding the semantics of weak references in SML/NJ, I think "non-deterministic" is a better characterization than "ambiguous." Also, the issues are more to do with the semantics of the language and what that allows compilers to do than the semantics of the GC (although the language semantics does allow for replication by the GC).
If you create a weak reference from an unboxed type, it will never be nullified. The problem with boxed values that do not have identity is that the semantics of SML does not distinguish between different copies of the same value. Here is a small example that illustrates the issue:

- fun f (x : (int * int)) = SMLofNJ.Weak.weak x;
val f = fn : int * int -> (int * int) ?.Weak.weak
- val y = (17, 42);
val y = (17,42) : int * int
- val z = f y;
val z = - : (int * int) ?.Weak.weak
- SMLofNJ.Internals.GC.doGC 5;
val it = () : unit
- SMLofNJ.Weak.strong z;
val it = NONE : (int * int) option
- y;
val it = (17,42) : int * int

So even though y is still live, the weak reference has been nullified. The reason is that the compiler will optimize the calling convention for a function that takes a pair of integer arguments to pass them in registers, so the heap object that the weak reference refers to is different from the heap object that y refers to.

There was a paper in ISSM'06 by Donnelly et al that defines a formal semantics for weak references. The paper claims that it can handle GHC's weak references, so perhaps that work could form a basis for specifying ephemerons.

@Skyb0rg007
Copy link
Contributor Author

To make sure I fully understand, the issue you describe could also occur if x had type (int * int ref) because tuples in SML have eta-equivalency: it is okay for the compiler to take apart a tuple and rebuild it.

@JohnReppy
Copy link
Contributor

I believe that you are correct. If it has type (int * int) ref, however, then y would be guaranteed to force the weak reference to be retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
0