8000 Add a pass to simplify single-character choices by markw65 · Pull Request #425 · peggyjs/peggy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add a pass to simplify single-character choices #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 7, 2023

Conversation

markw65
Copy link
@markw65 markw65 commented Aug 20, 2023

I have a modified version of examples/xml.peggy in my project, and was trying to optimize it for speed. I found a few places where it could be rewritten to avoid backtracking, getting about a 20% win, but with some readability cost.

Then I noticed that there are a number of long lists of alternate character classes, and they end up being implemented very inefficiently; effectively it tries one regex after another in sequence.

I manually combined all the character classes for BaseChar into a single character class, and got a nearly 2x speed up.

So I've added a pass that does that automatically. The result is a more than 3x speedup parsing the xml files that I work with (which seem fairly typical).

The pass takes a rule like chars: "a" / "b" / [d-r] and turns it into chars: [abd-r]

@markw65
Copy link
Author
markw65 commented Aug 20, 2023

I've put it under an option, mergeCharacterClasses because enabling it by default breaks a number of tests, which have specific expectations for error messages:

    Expected value to strictly be equal to:
      "Expected \"a\", \"b\", or \"c\" but \"d\" found."
    Received:
      "Expected [abc] but \"d\" found."

Since that test is specifically designed to ensure that an appropriate error message is output for choice nodes I didn't want to just change the expected output. But I'd be happy to change the grammar for such tests so that the choice nodes remain as choice nodes, and enable the optimization by default if that would be preferable.

Copy link
Member
@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree, that such optimization is very useful, but we need more tests to ensure, that it works as expected and does not processes pieces which it shouldn't.

Can you please add tests for negative scenarios, at least.

I'm not sure, should we add a setting for this optimization and should it be enabled by default. It seems to me, that settings is superfluous and we can apply this optimization always. So yes, I'm fine with changing expected error messages in failed tests

Comment on lines 15 to 17
" / 'a' / [c-g]",
"two = three / 'P' / [Q-T]",
"three = 'x' / [u-w]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to test overlapping ranges. Also, I think, you need to add negative tests, that will ensure, that significant parts of grammar doesn't lost. For example, if three rule will contain $ operator, or action block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agreed - I'll certainly add more tests.

For example, if three rule will contain $ operator, or action block

In that case, its expression's type will be text or action and asClass will ignore it. So I don't need to do anything to handle that.

But you made me realize that I should explicitly check for text and look through it, because we actually want
foo = "a" / $("b" / [xyz]) / "X" to get converted to foo = [Xabxyz].

Comment on lines 66 to 67
const a1 = Array.isArray(a) ? a[0] : a;
const b1 = Array.isArray(b) ? b[0] : b;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, that for clarity it is better to explicitly define start and end boundaries:

Suggested change
const a1 = Array.isArray(a) ? a[0] : a;
const b1 = Array.isArray(b) ? b[0] : b;
const [aStart, aEnd] = Array.isArray(a) ? a : [a, a];
const [bStart, bEnd] = Array.isArray(b) ? b : [b, b];

Also, maybe extract the whole sort and remove stuff into utils method and test it separately

@markw65
Copy link
Author
markw65 commented Aug 21, 2023

I just realized I need to update lib/peggy.js and docs/js/examples.js now that this is done unconditionally. Should I do that now, or wait until you're happy with the state of the commit?

@hildjj
Copy link
Contributor
hildjj commented Aug 21, 2023

This looks good to me, I think. @Mingun any other comments?

@hildjj
Copy link
Contributor
hildjj commented Aug 21, 2023

Except one needed addition: a CHANGELOG entry, please.

markw65 added 5 commits August 21, 2023 10:43
I have a modified version of examples/xml.peggy in my project, and
was trying to optimize it for speed. I found a few places where
it could be rewritten to avoid backtracking, getting about a 20%
win, but with some readability cost.

Then I noticed that there are a number of long lists of alternate
character classes, and they end up being implemented very
inefficiently; effectively it tries one regex after another in
sequence.

I manually combined all the character classes for BaseChar into
a single character class, and got a nearly 2x speed up.

So I've added a pass that does that automatically. The result is
a more than 3x speedup parsing the xml files that I work with
(which seem fairly typical).
@markw65 markw65 force-pushed the merge-character-classes branch from 5d56503 to c6980bd Compare August 21, 2023 17:47
@markw65
Copy link
Author
markw65 commented Aug 21, 2023

Except one needed addition: a CHANGELOG entry, please

I put it in "minor" updates - let me know if if should be somewhere else...

@hildjj
Copy link
Contributor
hildjj commented Aug 21, 2023

Sorry for moving the goalposts, but once I thought of coverage on #427, there are a few lines that need to be hit on this one in lib/compiler/passes/merge-character-classes.js, unless they are particularly hard to reach.

Copy link
Member
@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check how this optimization will affect source map generation. Other than that, all is good

return asClass(node.expression);
}
if (node.type === "literal" && node.value.length === 1) {
return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase };
return { type: "class", parts: [node.value], inverted: false, ignoreCase: node.ignoreCase };

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check how this optimization will affect source map generation

Good point. I think there's one small issue I need to address...

@markw65
Copy link
Author
markw65 commented Aug 21, 2023

I'm struggling with coverage. I've made a couple of changes to my new tests to hit the uncovered lines, and when I run as

npx jest -t mergeCharacterClasses

It shows that everything (in my file) is covered.

But when I run npm test or npm run coverage and run all the tests, it shows two lines as not covered... how does that happen?

[edit] Also, when I set breakpoints on the two supposedly uncovered lines, they both hit...

@hildjj
Copy link
Contributor
hildjj commented Aug 21, 2023

I'm going off what I see in the coverage/lcov-report directory after having run npm run build. That dir can sometimes be a little confusing because it doesn't get cleaned up every time, and the URLs change if you rerun with a subset of the tests. Try deleting the directory, then navigating to the correct file from index.html when you run a different set of tests.

Here are the three lines that I currently see uncovered in the full test suite: 22, 67, 73.

I don't see breakpoints hit on any of those, but checked to make sure that I have everything set up right by having breakpoints fire on other lines in your file.

@hildjj
Copy link
Contributor
hildjj commented Aug 21, 2023

Also, I'm on the Discord channel if you want to try and figure this out in realtime.

@markw65
Copy link
Author
markw65 commented Aug 21, 2023

I rewrote things a little:

  • some things that asClass was checking for were unnecessary because merge would have already handled them
  • preserve locations when merging nodes
  • only call cleanParts after merging all the alternates for efficiency

And then I updated the tests so that everything is covered.

@markw65
Copy link
Author
markw65 commented Aug 24, 2023

I think I addressed everything in the last update. Let me know if there's anything else to do here.

@hildjj
Copy link
Contributor
hildjj commented Aug 27, 2023

@Mingun, one last thumbs-up, please?

@hildjj
Copy link
Contributor
hildjj commented Aug 27, 2023

@Mingun this one needs a final thumbs-up as well, please.

Copy link
Member
@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approval, except minor nits.

It seems we can slightly optimize by reducing number of unnecessary clones, but benefit also seems to be very low. So can leave this in current state.

We can add tests for ignoreCase cases.

Also I suggest to insert session.emitInfo() calls when node changed.

Comment on lines 82 to 91
const cls = asClass(ref);
// Return a clone, not the original, because the value we return
// may get merged into.
return cls && {
type: cls.type,
parts: cls.parts.slice(),
inverted: cls.inverted,
ignoreCase: cls.ignoreCase,
location: node.location,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be better to request cloned data from asClass, because we need to clone only class nodes, other nodes already created by us as new objects.

So probably

return asClass(ref, true/* return a clone */);

here and asClass(..., false) otherwise else. Or even make clone at line 118

prev = cls;

It seems the clone is really needed only there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

While we're talking about clones though, it seems to be hard to produce generic clones in this project.

If I say Object.assign({}, objectToClone) eslint tells me to use { ...objectToClone }, but if I say { ...objectToClone }, it tells me Parsing error: Unexpected token ....

I don't see why we shouldn't use ... (it works with node14 and later, and will presumably get fixed by rollup)? Can I change the eslint settings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you double-check that ... gets replaced by the rollup step, submit a bug to https://github.com/peggyjs/peggyjs-eslint-config and I'll fix it. There are a couple of other things I want to make sure are up-to-date while I'm in there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it's actually the typescript step that is likely doing the replacement, which will be easier to check)

Comment on lines 2682 to 2687
expect(parser).to.failToParse("d", {
expected: [
{ type: "literal", text: "a", ignoreCase: false },
{ type: "literal", text: "b", ignoreCase: false },
{ type: "literal", text: "c", ignoreCase: false },
{ type: "literal", text: "aa", ignoreCase: false },
{ type: "literal", text: "bb", ignoreCase: false },
{ type: "literal", text: "cc", ignoreCase: false },
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, we should give parser a chance to not fail by parsing dd here and below :)

@markw65
Copy link
Author
markw65 commented Aug 30, 2023

Anything else to do here?

@markw65 markw65 mentioned this pull request Sep 6, 2023
@markw65
Copy link
Author
markw65 commented Sep 6, 2023

@hildjj @Mingun ping again?

@hildjj hildjj merged commit e25036f into peggyjs:main Sep 7, 2023
@markw65 markw65 deleted the merge-character-classes branch September 7, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0