-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
//g flag on regex with UTF-8 source causes regex optimiser to wrongly reject a match #16806
Comments
From @nwc10Created by @nwc10For the case where the eval'd source code contains Ā in a code comment this nicholas@dromedary-001 perl6$ cat /tmp/rule.pl my $mr = "L\xFCften Kalt"; # Culprit here is the //g flag: if ($got) { __END__ If one removes the //g flag, it does: nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g my $mr = "L\xFCften Kalt"; # Culprit here is the //g flag: if ($got) { __END__ I would expect that it should match always, independent of whether //g is Running with -Dr suggests that the culprit is the regex optimiser, [snip] EXECUTING... Compiling REx "L%x{fc}ften Kalt" This does not seem to be a regression - everything I sampled back to 5.6.0 Nicholas Clark Perl Info
|
@nwc10 - Status changed from 'new' to 'open' |
From @khwilliamsonOn 1/9/19 7:11 AM, Nicholas Clark (via RT) wrote:
My gvim syntax highlighter immediately showed that \x100 is \x10 |
From @nwc10On Wed, Jan 09, 2019 at 09:49:59AM -0700, Karl Williamson wrote:
Thanks for the rapid response. I think that you might be right, but will This would mean that I've failed to correctly reduce the original problem Nicholas Clark |
From @khwilliamsonOn 1/9/19 10:14 AM, Nicholas Clark wrote:
I looked at it a little more, and it seems weird that a comment would
|
From @tonycozYou're going to facepalm yourself. On Wed, 09 Jan 2019 06:11:06 -0800, nicholas wrote:
Add: pos($text) = 0; here and they all pass.
Tony |
From @nwc10On Wed, Jan 09, 2019 at 04:27:35PM -0800, Tony Cook via RT wrote:
Yes. :-)
Yes. I totally failed to figure out the proper test case. I now have a fresh On Wed, Jan 09, 2019 at 10:25:42AM -0700, Karl Williamson wrote:
OK, *here* is the test case. It's very sensitive to string length: nc@I34 perl2$ cat /tmp/133756.pl my $mr = "Programme: Automatic plus, Baumwolle, Pflegeleicht, Synthetic, Finish Wolle, Express, Jeans, Schongl\344tten, L\374ften Warm, L\374ften Kalt"; if (shift) { # use Devel::Peek; Dump($mr); my $got = eval "\$text =~ /$mr/i;"; if ($got) { __END__ It's "fixed" in blead: 792b461 is the first bad commit regcomp.c: Pack EXACTish nodes more fully Prior to this commit, nodes that are to match a string exactly, or :100644 100644 6dbfed52ab7d78f0909fce42f000127a13295155 74c8eb2e21b622e7c40dd726933f457353bfa9ee M regcomp.c and that commit is just a refactoring - no bug fixes, no new tests: commit 792b461 regcomp.c: Pack EXACTish nodes more fully regcomp.c | 33 ++++++++++++++------------------- Inline Patchdiff --git a/regcomp.c b/regcomp.c
index 6dbfed52ab..74c8eb2e21 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -13287,8 +13287,12 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
UV ender = 0;
char *p;
char *s;
-#define MAX_NODE_STRING_SIZE 127
+
+/* This allows us to fill a node with just enough spare so that if the final
+ * character folds, its expansion is guaranteed to fit */
+#define MAX_NODE_STRING_SIZE (255-UTF8_MAXBYTES_CASE)
char foldbuf[MAX_NODE_STRING_SIZE+UTF8_MAXBYTES_CASE+1];
+
char *s0;
U8 upper_parse = MAX_NODE_STRING_SIZE;
U8 node_type = compute_EXACTish(pRExC_state);
@@ -13332,24 +13336,15 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
* use a pseudo regnode like 'EXACT_ORIG_FOLD' */
maybe_exact = FOLD && PASS2;
- /* XXX The node can hold up to 255 bytes, yet this only goes to
- * 127. I (khw) do not know why. Keeping it somewhat less than
- * 255 allows us to not have to worry about overflow due to
- * converting to utf8 and fold expansion, but that value is
- * 255-UTF8_MAXBYTES_CASE. join_exact() may join adjacent nodes
- * split up by this limit into a single one using the real max of
- * 255. Even at 127, this breaks under rare circumstances. If
- * folding, we do not want to split a node at a character that is a
- * non-final in a multi-char fold, as an input string could just
- * happen to want to match across the node boundary. The join
- * would solve that problem if the join actually happens. But a
- * series of more than two nodes in a row each of 127 would cause
- * the first join to succeed to get to 254, but then there wouldn't
- * be room for the next one, which could at be one of those split
- * multi-char folds. I don't know of any fool-proof solution. One
- * could back off to end with only a code point that isn't such a
- * non-final, but it is possible for there not to be any in the
- * entire node. */
+ /* This breaks under rare circumstances. If folding, we do not
+ * want to split a node at a character that is a non-final in a
+ * multi-char fold, as an input string could just happen to want to
+ * match across the node boundary. The code at the end of the loop
+ * looks for this, and backs off until it finds not such a
+ * character, but it is possible (though extremely, extremely
+ * unlikely) for all characters in the node to be non-final fold
+ * ones, in which case we just leave the node fully filled, and
+ * hope that it doesn't match the string in just the wrong place */
assert( ! UTF /* Is at the beginning of a character */
|| UTF8_IS_INVARIANT(UCHARAT(RExC_parse))
Given that that commit just changes the value of MAX_NODE_STRING_SIZE The original problem was revealed as exactly 2 lines of difference in a pair So whatever it is, it's obscure. Nicholas Clark |
From @khwilliamsonOn 1/10/19 2:05 AM, Nicholas Clark wrote:
Thank you for reporting this. And your surmisal was correct. There is a real bug, as hopefully From 88c2ae302de11db0b66ff2269e2c4e9bf46efd69 Mon Sep 17 00:00:00 2001 This was caused by a counting error. An EXACTFish regnode has a finite length it can hold for the string A problem occurs if a regnode ends with one of the 22 characters in What happens if a node ends with one of the 22, is that the node is A /i node is filled with the fold of the input, for several reasons. This bug was that the code thought a two character backup was really a This bug is actually very unlikely to occur in most real world One of the 22 characters is 't', so long strings of DNA "ACTG" could |
From @jkeenanOn Fri, 11 Jan 2019 05:01:12 GMT, public@khwilliamson.com wrote: [snip]
I don't think the line above is correct. When I update blead the commit appears to be: ##### PATCH: [perl #133756] Failure to match properly ... Thank you very much. -- |
From @khwilliamsonOn Fri, 11 Jan 2019 05:48:17 -0800, jkeenan wrote:
This is the correct commit number for the fix. Now closing this ticket.
-- |
@khwilliamson - Status changed from 'open' to 'pending release' |
From @khwilliamsonThank you for filing this report. You have helped make Perl better. With the release today of Perl 5.30.0, this and 160 other issues have been Perl 5.30.0 may be downloaded via: If you find that the problem persists, feel free to reopen this ticket. |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#133756 (status was 'resolved')
Searchable as RT133756$
The text was updated successfully, but these errors were encountered: