The link prefix feature is enabled in 13 languages (ar, cu, cv, hy, is, ka, kaa, lbe, ln, mzn, pnb, uk, uz) via $linkPrefixExtension = true; in Messages<Code>.php.
This feature pulls characters immediately before a link into the label, e.g. be[[holden]] -> [[holden|beholden]].
The link suffix feature, which is enabled by default in all languages, has a default pattern of just letters a-z. Other languages extend this to include their alphabets.
The link prefix feature has a much broader set of characters, being a-z or any character above 0x80 in Unicode. While this includes every letter in every non-Latin alphabet, it also includes punctuation and various non-letters. It is also inconsistent with the link suffix behaviour.
This means effective list of characters that aren't matched is : !"#$%&'()*+,-./:;<=>?@[\]^_\{|}~, plus digits 0-9 and basic whitespace.
Some examples of inconsistencies this creates:
- be[[holden]] correctly matches
- @[[User:foo]] correctly doesn't match
- ▶[[User:foo]] incorrectly matches
- $[[1000]] correctly doesn't match
- £[[1000]] incorrectly matches (The A in ASCII is American, so they only included '$' :) )
- ¿[[Que]]? the ¿ incorrectly matches, but the ? doesn't.
Some languages have set $linkPrefixCharset to something sensible, e.g. Icelandic matches it to their $linkTrail (suffix) character set, such only letters and hypens will be matched.
However many languages use the overly broad default character set (presumably because it includes their alphabet, and implementors didn't bother to test if it included undesirable characters).
We should audit these 13 languages and restrict the prefix character set (by default to be the same as the suffix character set).
We will probably need to warn affected communities as some existing link labels may change (although visually any changes will be small and easily fixable).
Once this is done, we can reset to the default to null so that in future languages turning on this feature have to explicitly define a character set.
References
Status
Default prefix = 'a-zA-Z\\x{80}-\\x{10ffff}'
Default linkTrail = '/^([a-z]+)(.*)$/sD'
Language | linkTrail | Prefix | Notes |
---|---|---|---|
ar Arabic | a-zء-ي | Default | Should be fixed to match linkTrail |
cu Church Slavic | a-zабвгде<!--truncated-->ќуўџэ҄я“»]+ | „« | Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone |
cv Chuvash | a-zа-яĕçăӳ"» | a-zA-Z"\\x{80}-\\x{10ffff} | Default + ", should be fixed to match linkTrail but with inverted quotes. |
hy Armenian | a-zաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆև«» | Default | Should be fixed to match linkTrail |
is Icelandic | áðéíóúýþæöa-z-– | áÁðÐéÉíÍóÓúÚýÝþÞæÆöÖA-Za-z–- | Nothing to fix! |
ka Georgian | a-zაბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ“» | Default | Should be fixed to match linkTrail |
kaa Karakalpak | [a-zıʼ’“»]'(?!') | a-zıA-Zİ\\x80-\\xff | Probably ok to be left as-is, but could include opening quotes as close quotes are in suffix. Not sure the 80-ff range is required. |
lbe Lak | a-zабвгдеёжзийклмнопрстуфхцчшщъыьэюяӀ1“» | Default | Should be fixed to match linkTrail |
ln Lingala | Default | Default | Should be fixed to match linkTrail (which is also empty). Not sure this was turned on deliberate so could also be turned off. |
mzn Mazanderani | Default | Default | Should probably be turned off as linkTrail is only set to a-z and this language doesn't use Latin alphabet. |
pnb Western Punjabi | Default | Default | Same as mzn |
uk Ukrainian | a-zабвгґдеєжзиіїйклмнопрстуфхцчшщьєюяёъы“» | „« | Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone |
uz Uzbek | a-zʻʼ“» | a-zA-Z\\x80-\\xffʻʼ«„ | Okay to be left as is. Not sure the 80-ff range is required. |