-
Notifications
You must be signed in to change notification settings - Fork 69
Add a pass to simplify single-character choices #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've put it under an option,
Since that test is specifically designed to ensure that an appropriate error message is output for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree, that such optimization is very useful, but we need more tests to ensure, that it works as expected and does not processes pieces which it shouldn't.
Can you please add tests for negative scenarios, at least.
I'm not sure, should we add a setting for this optimization and should it be enabled by default. It seems to me, that settings is superfluous and we can apply this optimization always. So yes, I'm fine with changing expected error messages in failed tests
" / 'a' / [c-g]", | ||
"two = three / 'P' / [Q-T]", | ||
"three = 'x' / [u-w]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to test overlapping ranges. Also, I think, you need to add negative tests, that will ensure, that significant parts of grammar doesn't lost. For example, if three
rule will contain $
operator, or action block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly agreed - I'll certainly add more tests.
For example, if three rule will contain $ operator, or action block
In that case, its expression
's type will be text
or action
and asClass
will ignore it. So I don't need to do anything to handle that.
But you made me realize that I should explicitly check for text
and look through it, because we actually want
foo = "a" / $("b" / [xyz]) / "X"
to get converted to foo = [Xabxyz]
.
const a1 = Array.isArray(a) ? a[0] : a; | ||
const b1 = Array.isArray(b) ? b[0] : b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, that for clarity it is better to explicitly define start and end boundaries:
const a1 = Array.isArray(a) ? a[0] : a; | |
const b1 = Array.isArray(b) ? b[0] : b; | |
const [aStart, aEnd] = Array.isArray(a) ? a : [a, a]; | |
const [bStart, bEnd] = Array.isArray(b) ? b : [b, b]; |
Also, maybe extract the whole sort and remove stuff into utils
method and test it separately
I just realized I need to update |
This looks good to me, I think. @Mingun any other comments? |
Except one needed addition: a CHANGELOG entry, please. |
I have a modified version of examples/xml.peggy in my project, and was trying to optimize it for speed. I found a few places where it could be rewritten to avoid backtracking, getting about a 20% win, but with some readability cost. Then I noticed that there are a number of long lists of alternate character classes, and they end up being implemented very inefficiently; effectively it tries one regex after another in sequence. I manually combined all the character classes for BaseChar into a single character class, and got a nearly 2x speed up. So I've added a pass that does that automatically. The result is a more than 3x speedup parsing the xml files that I work with (which seem fairly typical).
5d56503
to
c6980bd
Compare
I put it in "minor" updates - let me know if if should be somewhere else... |
Sorry for moving the goalposts, but once I thought of coverage on #427, there are a few lines that need to be hit on this one in lib/compiler/passes/merge-character-classes.js, unless they are particularly hard to reach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check how this optimization will affect source map generation. Other than that, all is good
return asClass(node.expression); | ||
} | ||
if (node.type === "literal" && node.value.length === 1) { | ||
return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return { type: "class", parts: [node.value], inverted:false, ignoreCase: node.ignoreCase }; | |
return { type: "class", parts: [node.value], inverted: false, ignoreCase: node.ignoreCase }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check how this optimization will affect source map generation
Good point. I think there's one small issue I need to address...
I'm struggling with coverage. I've made a couple of changes to my new tests to hit the uncovered lines, and when I run as
It shows that everything (in my file) is covered. But when I run [edit] Also, when I set breakpoints on the two supposedly uncovered lines, they both hit... |
I'm going off what I see in the Here are the three lines that I currently see uncovered in the full test suite: 22, 67, 73. I don't see breakpoints hit on any of those, but checked to make sure that I have everything set up right by having breakpoints fire on other lines in your file. |
Also, I'm on the Discord channel if you want to try and figure this out in realtime. |
I rewrote things a little:
And then I updated the tests so that everything is covered. |
I think I addressed everything in the last update. Let me know if there's anything else to do here. |
@Mingun, one last thumbs-up, please? |
@Mingun this one needs a final thumbs-up as well, please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My approval, except minor nits.
It seems we can slightly optimize by reducing number of unnecessary clones, but benefit also seems to be very low. So can leave this in current state.
We can add tests for ignoreCase
cases.
Also I suggest to insert session.emitInfo()
calls when node changed.
const cls = asClass(ref); | ||
// Return a clone, not the original, because the value we return | ||
// may get merged into. | ||
return cls && { | ||
type: cls.type, | ||
parts: cls.parts.slice(), | ||
inverted: cls.inverted, | ||
ignoreCase: cls.ignoreCase, | ||
location: node.location, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could be better to request cloned data from asClass
, because we need to clone only class
nodes, other nodes already created by us as new objects.
So probably
return asClass(ref, true/* return a clone */);
here and asClass(..., false)
otherwise else. Or even make clone at line 118
prev = cls;
It seems the clone is really needed only there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
While we're talking about clones though, it seems to be hard to produce generic clones in this project.
If I say Object.assign({}, objectToClone)
eslint tells me to use { ...objectToClone }
, but if I say { ...objectToClone }
, it tells me Parsing error: Unexpected token ...
.
I don't see why we shouldn't use ...
(it works with node14 and later, and will presumably get fixed by rollup)? Can I change the eslint settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you double-check that ...
gets replaced by the rollup step, submit a bug to https://github.com/peggyjs/peggyjs-eslint-config and I'll fix it. There are a couple of other things I want to make sure are up-to-date while I'm in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it's actually the typescript step that is likely doing the replacement, which will be easier to check)
expect(parser).to.failToParse("d", { | ||
expected: [ | ||
{ type: "literal", text: "a", ignoreCase: false }, | ||
{ type: "literal", text: "b", ignoreCase: false }, | ||
{ type: "literal", text: "c", ignoreCase: false }, | ||
{ type: "literal", text: "aa", ignoreCase: false }, | ||
{ type: "literal", text: "bb", ignoreCase: false }, | ||
{ type: "literal", text: "cc", ignoreCase: false }, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, we should give parser a chance to not fail by parsing dd
here and below :)
Anything else to do here? |
I have a modified version of examples/xml.peggy in my project, and was trying to optimize it for speed. I found a few places where it could be rewritten to avoid backtracking, getting about a 20% win, but with some readability cost.
Then I noticed that there are a number of long lists of alternate character classes, and they end up being implemented very inefficiently; effectively it tries one regex after another in sequence.
I manually combined all the character classes for BaseChar into a single character class, and got a nearly 2x speed up.
So I've added a pass that does that automatically. The result is a more than 3x speedup parsing the xml files that I work with (which seem fairly typical).
The pass takes a rule like
chars: "a" / "b" / [d-r]
and turns it intochars: [abd-r]