8000 WebAssembly support · Issue #503 · ohmjs/ohm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WebAssembly support #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 14 tasks
pdubroy opened this issue Apr 16, 2025 · 7 comments
Open
3 of 14 tasks

WebAssembly support #503

pdubroy opened this issue Apr 16, 2025 · 7 comments

Comments

@pdubroy
Copy link
Contributor
pdubroy commented Apr 16, 2025

Over the years we've had a few requests for a way to use Ohm from other languages — e.g. #348 mentions Dart, Rust, and Swift. Go and Python have also come up in private discussions. This proposal discusses one way we could support that: by implementing the equivalent of the Grammar.match in WebAssembly.

Goals

The primary goal is to be able to produce a WebAssembly module (.wasm) from an existing Ohm grammar. The Wasm module would export a match function, which could be used just like the existing Grammar.match method.

At minimum, we also need an API (or ABI) for walking the parse tree.

If possible, we'd also like to go a bit further:

  • support for incremental parsing
  • a well-defined way of either (a) compiling semantic actions to WebAssembly, or (b) writing semantic actions in the host language, using the parse tree API/ABI.
  • (maybe) replacing the existing Ohm.js core with the WebAsembly version

Non-goals

We won't support the full Ohm API from WebAssembly (trace, etc.) We assume that it's acceptable to run JavaScript code to produce the WebAssembly module.

Initial Plan

MVP:

  • Support a subset of Ohm grammars with pure PEG features only (no left recursion, no parameterized rules, etc.)
    • Implement compilation to Wasm for a very minimal set of features, e.g. only terminals and rule applications. No memoization.
    • Add tests that use the grammar from JS and Go.
    • Add support for pure-PEG features: ranges, any, repetition, lookahead, sequence, choice
  • Add initial support for memoization (simple, not focused on performance yet)
    • At minimum, we need a reasonable hash map implementation - maybe in AssemblyScript or Rust no_std.
  • Implement some initial performance benchmarks. Probably using the pure-PEG ES5 grammar from our SLE17 paper.
  • First cut at semantic actions in Go.
    • Unclear what this should look like. Importing a table per operation?
  • Parameterized rules
  • Direct left recursion

Post MVP:

  • Grammar inheritance
  • Space skipping logic
  • Incremental parsing
  • Extend Ohm CLI to make it easy to create Wasm bundles
  • error handling
  • implementing trace?

Beyond:

Open questions

  • For runtime code (e.g. managing memo table, potentially indirect left recursion handling, etc.), what language should we use? AssemblyScript? Rust no_std?
  • Memory management: should we try Wasm GC? Some kind of arena allocation strategy?
    • Needed for: memo table, CST nodes, rule stack?
  • Are we ok with making some breaking changes to the Ohm API? Supporting only direct left recursion would allow us to significantly simplify the implementation.
@alexwarth
Copy link
Contributor
alexwarth commented Apr 16, 2025

Hi @pdubroy,

Overall this plan looks great to me, and reflects much of what we discussed yesterday.

(1) Re: the goals,

(maybe) replacing the existing Ohm.js core with the WebAsembly version

I'm going to push for this one to go from a maybe to a definitely. For compatibility with existing grammars, it would then be important to implement the whitespace-skipping for syntactic rules and grammar inheritance. (I don't expect either of these things to be tricky, it's just more work that would need to get done.)

(2)

At minimum, we need a reasonable hash map implementation - maybe in AssemblyScript or Rust no_std.

Not necessarily. It's common for Smalltalk VMs to implement polymorphic inline caches with a small list of (class, method implentation) tuples per call site. This list is traversed linearly: Is the receiver an instance of this class? If so, then that's the implementation of the method. If not, is the receiver an instance of the second class? (And so on.) I remember Eliot Miranda pointing out that this was way faster than using a hash map.

Assuming only a small number of rules is applied at any particular position, this scheme could work very well.

(3)

Are we ok with making some breaking changes to the Ohm API? Supporting only direct left recursion would allow us to significantly simplify the implementation.

Like I said yesterday, I'm all for getting nixing indirect left recursion -- it adds a lot of complexity to the implementation and on the user's side, it's difficult to think about and rarely necessary.

I also think that we could do a better job of designing an API for writing semantic actions now than we did back in the day. (We've had lots of experience using it in Typescript, and it's easy to see several opportunities for improvement.) But this doesn't have to be a breaking change -- for a while, we could support both the old style and a better alternative. The old stuff could be deprecated and eventually go away.

Excited about this!

-- Alex

@pdubroy
Copy link
Contributor Author
pdubroy commented Apr 16, 2025

It's common for Smalltalk VMs to implement polymorphic inline caches with a small list of (class, method implentation) tuples per call site. This list is traversed linearly: Is the receiver an instance of this class? If so, then that's the implementation of the method. If not, is the receiver an instance of the second class? (And so on.) I remember Eliot Miranda pointing out that this was way faster than using a hash map.

Right! You mentioned this yesterday. You're right, there may be simpler options too. Although a true inline cache unfortunately can't be done in Wasm, because the code isn't writeable. But still, you're right that we might not even need a hash table.

@pdubroy
Copy link
Contributor Author
pdubroy commented Apr 29, 2025

A progress update —

The following test is now passing:

test('basic wasm support', async t => {
  const g = ohm.grammar(`
    G {
      
    }
  `);
  const matcher = await WasmMatcher.forGrammar(g);
  matcher.setInput('1');
  t.is(matcher.match(), 1);
});

This is really the most minimal thing possible, but required getting a lot of the basic machinery in place:

  • The WasmMatcher class, which is similar our existing Matcher class, but which creates and instantiates the Wasm module, and handles passing the input into Wasm.
  • Simple Wasm codegen to handle matching one-character terminals. It handles saving and restoring the input position (required for backtracking), updating the input position, and reading from input[currentPos].

Next:

  • Get multi-character Terminals working (just need to add a loop, but this will test that I'm reading from the input properly, etc.)
  • Start working on other PExpr types.

@pdubroy
Copy link
Contributor Author
pdubroy commented Apr 30, 2025

More progress:

  • Realized that some of my previous progress was an illusion, and one of my tests wasn't testing what I thought it was. But, fixed that issue up and got it passing for real.
  • Got multi-character terminals working, and implemented proper checking for end of input (matching should fail if not everything is consumed — at least under the current Ohm semantics)
  • Started working on choice (aka alternation), and handling short-circuiting. This made me realize that I need to change how I'm handing success/failure of subexpressions. I am currently using return values, which is how it's done in the JS implementation of Ohm. But that doesn't here — in JS, we have a stack frame from every subexpression, but the way I'm doing it in Wasm, we want a single function for the entire rule. If I use return in a subexpression, it causes the entire rule application to succeed/fail, which isn't right.
    • It's easy enough to fix — just need to "register" (fixed local variable) for the success/failure of individual parsing expressions.

@pdubroy
Copy link
Contributor Author
pdubroy commented May 2, 2025

More progress — I have choice, rule application, and sequence all working correctly. The following test now passes:

test('wasm: rule application', async t => {
  const g = ohm.grammar(`
    G {
      start = one two -- x
            | three
      
      two = "II" | "2"
      three = "3"
    }
  `);
  const matcher = await WasmMatcher.forGrammar(g);
  t.is(matchWithInput(matcher, '12'), 1);
  t.is(matchWithInput(matcher, '1II'), 1);
  t.is(matchWithInput(matcher, '3'), 1);
  t.is(matchWithInput(matcher, '13'), 0);
});

I also implemented a few things to make the generated code easier to debug. I'm currently inlining the code for all pexprs into a single function for the entire rule body, and I now have the ability to generate labels (using dummy call instructions) to see where the code for individual pexprs begins and ends:

Image

My goal next week is to get parity with the simple PEG language that we implemented for our SLE '17 paper. That will enable a couple of things:

  • shaking out bugs, as we can test with the ES5 grammar
  • doing some initial performance measurement. Performance is not the main reason for doing this, but I'm still interested to know how it performs.

And @alexwarth suggested that we should also try it without inlining, i.e. a separate function for each kind of eval method, just like Ohm does it today. It would be interesting to see if that leads to better or worse performance.

pdubroy added a commit that referenced this issue May 9, 2025
Initial support for WebAssembly, as describe in #503.

Adds a new package which exposes a `WasmMatch` class. Internally, `WasmMatcher` compiles the grammar to a WebAssembly module that exports a `match` method.

Working so far:
- basic matching use mostly pure-PEG features (no left recursion, parameterized rules, etc.)
- rudimentary CST with a few tests

Not yet:
- memoization
- support for Ohm-specific features
- clean interface for writing semantic actions
@pdubroy
Copy link
Contributor Author
pdubroy commented May 9, 2025

Made some good progress this week! Everything is now on main, in a separate wasm package.

We can now compile real-world grammars to WebAssembly. Haven't tried ES5 yet, but there's ~50-line grammar in the test that is reasonably complicated. It does support inheritance and most of the built-in primitive rules, but not everything (e.g. UnicodeChar('Ltmo') is not supported yet).

There's also rudimentary support for building a very minimal CST, and a JavaScript helper that can walk the CST. We don't yet record rule names though, so it's not yet possible to write proper semantic actions.

There's also no memoization, automatic space skipping, or any Ohm-specific features.

Next week I plan to implement some kind of memoization, and implement basic support for true semantic actions.

Edit: I landed a modified version of the ES5 grammar from our SLE'17 paper. It currently fails to parse a trivial example, but I will look into that soon.

@pdubroy
Copy link
Contributor Author
pdubroy commented May 16, 2025

Progress this week:

  • I fixed the ES5 grammar, and got it working for small examples. However, that grammar was never intended to be used without memoization, and the way it's written makes the parse times blow up exponentially in the absence of memoization, so I couldn't really test larger examples.
  • So, I implemented memoization. But when I did that, I realized that I needed to rethink the CST representation. I originally used very simple, pointer-free implementation (a variation on Apter trees), but that was very awkward to deal with in the presence of memoization.
  • So I changed the way that I'm doing CSTs, and implemented a more conventional representation. It is a bit more direct than the way it works in JS though — it stores the child pointers inline, rather than in a separate object.

Image

  • This works well for almost all node types, except iteration nodes, because an iteration node can have any number of children. So I implemented a solution to support an arbitrary number of children. I wrote about that here: https://bsky.app/profile/dubroy.com/post/3lpcbstxi6g2a
  • Now that I have an appropriate CST representation, it should (hopefully 🤞) be fairly easy to get proper memoization working. That will be my task for Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0