Default $linkPrefixCharset is too broad
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Esanders
	Sep 1 2020, 1:16 PM

Description

The link prefix feature is enabled in 13 languages (ar, cu, cv, hy, is, ka, kaa, lbe, ln, mzn, pnb, uk, uz) via $linkPrefixExtension = true; in Messages<Code>.php.

This feature pulls characters immediately before a link into the label, e.g. be[[holden]] -> [[holden|beholden]].

The link suffix feature, which is enabled by default in all languages, has a default pattern of just letters a-z. Other languages extend this to include their alphabets.

The link prefix feature has a much broader set of characters, being a-z or any character above 0x80 in Unicode. While this includes every letter in every non-Latin alphabet, it also includes punctuation and various non-letters. It is also inconsistent with the link suffix behaviour.

This means effective list of characters that aren't matched is : !"#$%&'()*+,-./:;<=>?@[\]^_\{|}~, plus digits 0-9 and basic whitespace.

Some examples of inconsistencies this creates:

be[[holden]] correctly matches
@[[User:foo]] correctly doesn't match
▶[[User:foo]] incorrectly matches
$[[1000]] correctly doesn't match
£[[1000]] incorrectly matches (The A in ASCII is American, so they only included '$' :) )
¿[[Que]]? the ¿ incorrectly matches, but the ? doesn't.

Some languages have set $linkPrefixCharset to something sensible, e.g. Icelandic matches it to their $linkTrail (suffix) character set, such only letters and hypens will be matched.

However many languages use the overly broad default character set (presumably because it includes their alphabet, and implementors didn't bother to test if it included undesirable characters).

We should audit these 13 languages and restrict the prefix character set (by default to be the same as the suffix character set).

We will probably need to warn affected communities as some existing link labels may change (although visually any changes will be small and easily fixable).

Once this is done, we can reset to the default to null so that in future languages turning on this feature have to explicitly define a character set.

References

Status

Default prefix = 'a-zA-Z\\x{80}-\\x{10ffff}'
Default linkTrail = '/^([a-z]+)(.*)$/sD'

Language	linkTrail	Prefix	Notes
ar Arabic	a-zء-ي	Default	Should be fixed to match linkTrail
cu Church Slavic	a-zабвгде<!--truncated-->ќуўџэ҄я“»]+	„«	Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone
cv Chuvash	a-zа-яĕçăӳ"»	a-zA-Z"\\x{80}-\\x{10ffff}	Default + ", should be fixed to match linkTrail but with inverted quotes.
hy Armenian	a-zաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆև«»	Default	Should be fixed to match linkTrail
is Icelandic	áðéíóúýþæöa-z-–	áÁðÐéÉíÍóÓúÚýÝþÞæÆöÖA-Za-z–-	Nothing to fix!
ka Georgian	a-zაბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ“»	Default	Should be fixed to match linkTrail
kaa Karakalpak	[a-zıʼ’“»]'(?!')	a-zıA-Zİ\\x80-\\xff	Probably ok to be left as-is, but could include opening quotes as close quotes are in suffix. Not sure the 80-ff range is required.
lbe Lak	a-zабвгдеёжзийклмнопрстуфхцчшщъыьэюяӀ1“»	Default	Should be fixed to match linkTrail
ln Lingala	Default	Default	Should be fixed to match linkTrail (which is also empty). Not sure this was turned on deliberate so could also be turned off.
mzn Mazanderani	Default	Default	Should probably be turned off as linkTrail is only set to a-z and this language doesn't use Latin alphabet.
pnb Western Punjabi	Default	Default	Same as mzn
uk Ukrainian	a-zабвгґдеєжзиіїйклмнопрстуфхцчшщьєюяёъы“»	„«	Appears to be deliberate set to just quotation marks. Suffix setting is broader. Can be left alone
uz Uzbek	a-zʻʼ“»	a-zA-Z\\x80-\\xffʻʼ«„	Okay to be left as is. Not sure the 80-ff range is required.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T261750 Default $linkPrefixCharset is too broad
		Resolved		Esanders	T263266 $linkPrefixCharset is too broad in Arabic

Event Timeline

Esanders created this task.Sep 1 2020, 1:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2020, 1:16 PM

Esanders mentioned this in T258743: Change what symbol is prepended to pings on ar.wiki.Sep 1 2020, 1:18 PM

Esanders updated the task description. (Show Details)Sep 1 2020, 3:02 PM

Esanders added a project: Editing-team (Kanban Board).

Esanders merged a task: T261707: Fix link prefixes.

Tacsipacsi mentioned this in T261709: Change ar.wiki ping prefix.Sep 1 2020, 6:30 PM

JTannerWMF moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.Sep 2 2020, 4:45 PM

JTannerWMF moved this task from Ready to Be Worked On to Upcoming on the Editing-team (Kanban Board) board.

Esanders updated the task description. (Show Details)Sep 18 2020, 4:19 PM

Restricted Application added a subscriber: Base. · View Herald TranscriptSep 18 2020, 4:19 PM

Once all the languages listed as "Default" above are fixed, we can remove the default value in MessagesEn.php and instead throw an error if a $linkPrefixCharset is not set while $linkPrefixExtension is enabled.

Esanders mentioned this in T263266: $linkPrefixCharset is too broad in Arabic.Sep 18 2020, 5:03 PM

Esanders updated the task description. (Show Details)Sep 18 2020, 5:24 PM

ppelberg closed subtask T263266: $linkPrefixCharset is too broad in Arabic as Resolved.Oct 23 2020, 1:21 AM

MBinder_WMF moved this task from Kanban Board to Next Quarter on the Editing-team board.Mar 17 2021, 10:17 PM

MBinder_WMF edited projects, added Editing-team; removed Editing-team (Kanban Board).

PirjanovNurlan updated the task description. (Show Details)Aug 6 2022, 6:41 PM