-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More strict rules for group numbers and names in RE #91760
Comments
…group names in RE Only sequence of ASCII digits not starting with 0 (except group 0) is now accepted as a numerical reference. The group name in bytes patterns and replacement strings can now only contain ASCII letters and digits and underscore.
…id in future Only sequence of ASCII digits not starting with 0 (except group 0) will be accepted as a numerical reference. The group name in bytes patterns and replacement strings could only contain ASCII letters and digits and underscore.
A deprecation period of at least one release would be good. |
Could you please review the PR? I am not sure about forbidding the initial zero. On one hand, |
…future (GH-91794) Only sequence of ASCII digits will be accepted as a numerical reference. The group name in bytes patterns and replacement strings could only contain ASCII letters and digits and underscore.
…names in RE (GH-91792) Only sequence of ASCII digits is now accepted as a numerical reference. The group name in bytes patterns and replacement strings can now only contain ASCII letters and digits and underscore.
There were unintentional changes in parsing regular expressions between Python 2 and Python 3.
Group references.
In patterns and replacement strings you can refer a group by its number using syntax
\N
where N is a 1-2 digit decimal number. The number should not start by 0, because it will be in an octal escape sequence. The group number can also be used in the conditional expression(?(N)...)
in patterns and in references\g<N>
in replacement strings. And it is interesting, that in Python 3 it can be not only a sequence of decimal digits. The following things are allowed in the group number:\g<01>
.\g< 1 >
.\g<1_2>
.\g<¹>
.\g<१>
.All this is purely an implementation artifact. After
\g<
we search the nearest>
and pass a substring between<
and>
toint()
. In other implementation we could search the longest sequence of decimal digits and all above examples (except may be the first one) would be filtered out automatically.Group names.
In
(?P<name>...)
,(?P=name)
,(?(name)...)
and\g<name>
we can refer groups by name. To avoid ambiguity there is a limitation: the name should follow the rules for identifier. In Python 2 it means that it should contain only letters, digits and underscores and start with a non-digit. Letters and digits are ASCII-only: [A-Za-z] and [0-9].In Python 3 identifiers can contain non-ASCII letters and digits. It is good. But in bytes patterns and replacement strings the codes
\xaa
,\xb2
,\xb3
,\xb5
,\xb9
,\xba
,\xc0
-\xd6
,\xd8
-\xf6
,\xf8
-\xff
are allowed in the group name. They correspond charactersª²³µ¹ºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
after decoding.It is an implementation artifact too. Bytes patterns and replacement strings are decoded with the Latin1 encoding for parsing. It simplifies and speeds up the code. There is no other reason why letters and digits in the range U-0080--U-00FF are allowed.
Note that In Python 3 the bytes literal can only contain printable literal characters in the ASCII range. Codes outside of this range should be represented as octal or hexadecimal escape sequences. So supporting non-ASCII letters and digits does not add to readability.
Since the above "features" are not intentional, not supported by most other RE engines (except
regex
, which is also written in Python), are not tested, and can be changed in result of refactoring the parser, I suggest to introduce more strict rules on group number and name.The question: do we need a deprecation period for this? I have wrote a code for both options (with deprecation and with error), will create PRs tomorrow.
The text was updated successfully, but these errors were encountered: