Add a pass to simplify single-character choices #425

markw65 · 2023-08-20T00:13:21Z

I have a modified version of examples/xml.peggy in my project, and was trying to optimize it for speed. I found a few places where it could be rewritten to avoid backtracking, getting about a 20% win, but with some readability cost.

Then I noticed that there are a number of long lists of alternate character classes, and they end up being implemented very inefficiently; effectively it tries one regex after another in sequence.

I manually combined all the character classes for BaseChar into a single character class, and got a nearly 2x speed up.

So I've added a pass that does that automatically. The result is a more than 3x speedup parsing the xml files that I work with (which seem fairly typical).

The pass takes a rule like chars: "a" / "b" / [d-r] and turns it into chars: [abd-r]

markw65 · 2023-08-20T00:18:25Z

I've put it under an option, mergeCharacterClasses because enabling it by default breaks a number of tests, which have specific expectations for error messages:

    Expected value to strictly be equal to:
      "Expected \"a\", \"b\", or \"c\" but \"d\" found."
    Received:
      "Expected [abc] but \"d\" found."

Since that test is specifically designed to ensure that an appropriate error message is output for choice nodes I didn't want to just change the expected output. But I'd be happy to change the grammar for such tests so that the choice nodes remain as choice nodes, and enable the optimization by default if that would be preferable.

Mingun

I generally agree, that such optimization is very useful, but we need more tests to ensure, that it works as expected and does not processes pieces which it shouldn't.

Can you please add tests for negative scenarios, at least.

I'm not sure, should we add a setting for this optimization and should it be enabled by default. It seems to me, that settings is superfluous and we can apply this optimization always. So yes, I'm fine with changing expected error messages in failed tests

Mingun · 2023-08-20T13:43:33Z

test/unit/compiler/passes/merge-character-classes.spec.js

+        " / 'a' / [c-g]",
+        "two = three / 'P' / [Q-T]",
+        "three = 'x' / [u-w]",


You also need to test overlapping ranges. Also, I think, you need to add negative tests, that will ensure, that significant parts of grammar doesn't lost. For example, if three rule will contain $ operator, or action block

Mostly agreed - I'll certainly add more tests.

For example, if three rule will contain $ operator, or action block

In that case, its expression's type will be text or action and asClass will ignore it. So I don't need to do anything to handle that.

But you made me realize that I should explicitly check for text and look through it, because we actually want
foo = "a" / $("b" / [xyz]) / "X" to get converted to foo = [Xabxyz].

Mingun · 2023-08-20T13:56:40Z

lib/compiler/passes/merge-character-classes.js

+              const a1 = Array.isArray(a) ? a[0] : a;
+              const b1 = Array.isArray(b) ? b[0] : b;


I think, that for clarity it is better to explicitly define start and end boundaries:

Suggested change

const a1 = Array.isArray(a) ? a[0] : a;

const b1 = Array.isArray(b) ? b[0] : b;

const [aStart, aEnd] = Array.isArray(a) ? a : [a, a];

const [bStart, bEnd] = Array.isArray(b) ? b : [b, b];

Also, maybe extract the whole sort and remove stuff into utils method and test it separately

markw65 · 2023-08-21T00:22:58Z

I just realized I need to update lib/peggy.js and docs/js/examples.js now that this is done unconditionally. Should I do that now, or wait until you're happy with the state of the commit?

hildjj · 2023-08-21T17:34:45Z

This looks good to me, I think. @Mingun any other comments?

hildjj · 2023-08-21T17:35:22Z

Except one needed addition: a CHANGELOG entry, please.

I have a modified version of examples/xml.peggy in my project, and was trying to optimize it for speed. I found a few places where it could be rewritten to avoid backtracking, getting about a 20% win, but with some readability cost. Then I noticed that there are a number of long lists of alternate character classes, and they end up being implemented very inefficiently; effectively it tries one regex after another in sequence. I manually combined all the character classes for BaseChar into a single character class, and got a nearly 2x speed up. So I've added a pass that does that automatically. The result is a more than 3x speedup parsing the xml files that I work with (which seem fairly typical).

markw65 · 2023-08-21T17:49:28Z

Except one needed addition: a CHANGELOG entry, please

I put it in "minor" updates - let me know if if should be somewhere else...

hildjj · 2023-08-21T17:54:32Z

Sorry for moving the goalposts, but once I thought of coverage on #427, there are a few lines that need to be hit on this one in lib/compiler/passes/merge-character-classes.js, unless they are particularly hard to reach.

Mingun

Need to check how this optimization will affect source map generation. Other than that, all is good

Mingun · 2023-08-21T18:29:41Z

lib/compiler/passes/merge-character-classes.js

+        return asClass(node.expression);
+      }
+      if (node.type === "literal" && node.value.length === 1) {
+        return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase };


Suggested change

return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase };

return { type: "class", parts: [node.value], inverted: false, ignoreCase: node.ignoreCase };

Need to check how this optimization will affect source map generation

Good point. I think there's one small issue I need to address...

markw65 · 2023-08-21T19:09:44Z

I'm struggling with coverage. I've made a couple of changes to my new tests to hit the uncovered lines, and when I run as

npx jest -t mergeCharacterClasses

It shows that everything (in my file) is covered.

But when I run npm test or npm run coverage and run all the tests, it shows two lines as not covered... how does that happen?

[edit] Also, when I set breakpoints on the two supposedly uncovered lines, they both hit...

hildjj · 2023-08-21T19:50:37Z

I'm going off what I see in the coverage/lcov-report directory after having run npm run build. That dir can sometimes be a little confusing because it doesn't get cleaned up every time, and the URLs change if you rerun with a subset of the tests. Try deleting the directory, then navigating to the correct file from index.html when you run a different set of tests.

Here are the three lines that I currently see uncovered in the full test suite: 22, 67, 73.

I don't see breakpoints hit on any of those, but checked to make sure that I have everything set up right by having breakpoints fire on other lines in your file.

hildjj · 2023-08-21T19:51:22Z

Also, I'm on the Discord channel if you want to try and figure this out in realtime.

…arts

markw65 · 2023-08-21T21:50:30Z

I rewrote things a little:

some things that asClass was checking for were unnecessary because merge would have already handled them
preserve locations when merging nodes
only call cleanParts after merging all the alternates for efficiency

And then I updated the tests so that everything is covered.

markw65 · 2023-08-24T14:50:14Z

I think I addressed everything in the last update. Let me know if there's anything else to do here.

hildjj · 2023-08-27T19:39:41Z

@Mingun, one last thumbs-up, please?

hildjj · 2023-08-27T19:41:51Z

@Mingun this one needs a final thumbs-up as well, please.

Mingun

My approval, except minor nits.

It seems we can slightly optimize by reducing number of unnecessary clones, but benefit also seems to be very low. So can leave this in current state.

We can add tests for ignoreCase cases.

Also I suggest to insert session.emitInfo() calls when node changed.

Mingun · 2023-08-28T16:01:38Z

lib/compiler/passes/merge-character-classes.js

+          const cls = asClass(ref);
+          // Return a clone, not the original, because the value we return
+          // may get merged into.
+          return cls && {
+            type: cls.type,
+            parts: cls.parts.slice(),
+            inverted: cls.inverted,
+            ignoreCase: cls.ignoreCase,
+            location: node.location,
+          };


Maybe it could be better to request cloned data from asClass, because we need to clone only class nodes, other nodes already created by us as new objects.

So probably

return asClass(ref, true/* return a clone */);

here and asClass(..., false) otherwise else. Or even make clone at line 118

prev = cls;

It seems the clone is really needed only there

Makes sense.

While we're talking about clones though, it seems to be hard to produce generic clones in this project.

If I say Object.assign({}, objectToClone) eslint tells me to use { ...objectToClone }, but if I say { ...objectToClone }, it tells me Parsing error: Unexpected token ....

I don't see why we shouldn't use ... (it works with node14 and later, and will presumably get fixed by rollup)? Can I change the eslint settings?

If you double-check that ... gets replaced by the rollup step, submit a bug to https://github.com/peggyjs/peggyjs-eslint-config and I'll fix it. There are a couple of other things I want to make sure are up-to-date while I'm in there.

(it's actually the typescript step that is likely doing the replacement, which will be easier to check)

Mingun · 2023-08-28T16:02:53Z

test/behavior/generated-parser-behavior.spec.js

          expect(parser).to.failToParse("d", {
            expected: [
-              { type: "literal", text: "a", ignoreCase: false },
-              { type: "literal", text: "b", ignoreCase: false },
-              { type: "literal", text: "c", ignoreCase: false },
+              { type: "literal", text: "aa", ignoreCase: false },
+              { type: "literal", text: "bb", ignoreCase: false },
+              { type: "literal", text: "cc", ignoreCase: false },
            ],


Strictly speaking, we should give parser a chance to not fail by parsing dd here and below :)

test/unit/compiler/passes/merge-character-classes.spec.js

…o clone

markw65 · 2023-08-30T18:28:21Z

Anything else to do here?

markw65 · 2023-09-06T22:12:55Z

@hildjj @Mingun ping again?

Mingun suggested changes Aug 20, 2023

View reviewed changes

markw65 mentioned this pull request Aug 21, 2023

Fix a typo in examples/xml.peggy #426

Merged

markw65 mentioned this pull request Aug 21, 2023

Avoid double extraction of substrings in various MATCH_ bytecodes #427

Merged

markw65 added 5 commits August 21, 2023 10:43

Remove option, and update tests to match

eac0374

Better cleanup of merged class nodes, and more tests

51031c1

Update CHANGELOG.md

3ec0b87

Update build artifacts

c6980bd

markw65 force-pushed the merge-character-classes branch from 5d56503 to c6980bd Compare August 21, 2023 17:47

Mingun approved these changes Aug 21, 2023

View reviewed changes

markw65 added 2 commits August 21, 2023 14:36

Preserve locations, and don't call cleanParts until we have all the p…

6c7c6b8

…arts

Update tests to ensure better coverage

6fcbecc

Mingun approved these changes Aug 28, 2023

View reviewed changes

markw65 added 3 commits August 28, 2023 10:37

Fixes for review comments

31afa1c

Build Artifacts

965d772

d -> dd

2a6393b

markw65 mentioned this pull request Aug 28, 2023

Lint doesn't allow cloning via Object.assign({}, obj) or via { ...obj } peggyjs/peggyjs-eslint-config#14

Closed

markw65 mentioned this pull request Aug 29, 2023

Es2018 peggyjs/peggyjs-eslint-config#15

Merged

markw65 added 3 commits August 29, 2023 08:09

Update to latest @peggyjs/eslint-config

cbc8f8d

Turn on type checking for merge-character-classes, and use ellipsis t…

88df0a4

…o clone

Build Artifacts

8527532

markw65 mentioned this pull request Sep 6, 2023

Bytecode optimizer #429

Closed

Mingun approved these changes Sep 7, 2023

View reviewed changes

hildjj merged commit e25036f into peggyjs:main Sep 7, 2023

markw65 deleted the merge-character-classes branch September 7, 2023 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a pass to simplify single-character choices #425

Add a pass to simplify single-character choices #425

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		const a1 = Array.isArray(a) ? a[0] : a;
		const b1 = Array.isArray(b) ? b[0] : b;

	return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase };
	return { type: "class", parts: [node.value], inverted: false, ignoreCase: node.ignoreCase };

Add a pass to simplify single-character choices #425

Add a pass to simplify single-character choices #425

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!