linguist: does not work syntax highlighting when used with \b and the Russian alphabet

Good afternoon. Create syntax highlighting for 1C:Enterprise, which supports the Russian key words. When we use \b(Если|If)\b in github https://github-lightshow.herokuapp.com keywords are not highlighted. Do like this (?<=[^\w-а-яё\.]|^)(Если|If)(?=[^\w-а-яё\.]|$) works. Files in UTF-8.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 48 (21 by maintainers)

Most upvoted comments

Ah right, that makes sense then. I wasn’t aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I’d say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise’s one does). Personally, I feel that’s a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

Alhadis on Oct 24, 2016

@worldbeater I doubt the PCRE version has anything to do with this. It’s running in ASCII-only mode, that’s all I know (I’m not staff). But of check the manpage for pcresyntax(3), you’ll find documentation for variances between regular expression engines (starting at the section “Backreferences”).

These are the likely discrepancies that that’re affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax	Description	Engines
`(?R)`	recurse whole pattern	All
`(?n)`	call subpattern by absolute number	All
`(?+n)`	call subpattern by relative number	All
`(?-n)`	call subpattern by relative number	All
`(?&name)`	call subpattern by name	Perl, PCRE
`(?P>name)`	call subpattern by name	Python
`\g<name>`	call subpattern by name	Oniguruma
`\g'name'`	call subpattern by name	Oniguruma
`\g<n>`	call subpattern by absolute number	Oniguruma
`\g'n'`	call subpattern by absolute number	Oniguruma
`\g<+n>`	call subpattern by relative number	PCRE
`\g'+n'`	call subpattern by relative number	PCRE
`\g<-n>`	call subpattern by relative number	PCRE
`\g'-n'`	call subpattern by relative number	PCRE

Alhadis on Apr 13, 2018

If @vmg pings @arfon for an answer, I’m outta here.

Hahaha. I can answer this: we don’t currently use the PCRE_UCP flag because it’s a really significant performance degradation. I acknowledge this is not an ideal answer – I’m looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can’t enable it by default because it really slows down syntax highlighting. 😢

I don’t have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

vmg on Oct 24, 2016

(apologies in advance, i don’t intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at https://github.com/atom/language-csharp/issues/112#issuecomment-379094384 by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

ghost on Apr 6, 2018

Don’t screw it up. The Russians are watching.

Alhadis on Oct 24, 2016

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you’d be asking us to provide support for PCRE’s “complicated syntax” too, because the Oniguruma engine isn’t good enough.

Alhadis on Apr 9, 2018

One more ping, guys. (sorry)

nixel2007 on Feb 4, 2019

i think this issue should be opened. the problem still exists.

nixel2007 on Nov 6, 2018

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

worldbeater on Apr 9, 2018

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I’d rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

arfon on Oct 24, 2016

ersonally, I feel that’s a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

nixel2007 on Oct 24, 2016

@arfon / @bkeepers Do you guys know what it’ll take to get PCRE’s PCRE_UCP flag used on the site?

@vmg may be able to answer this

aroben on Oct 24, 2016

Okay… Is it easy to change one engine to another? Can we just ask the Github guys to make this change?

nixel2007 on Oct 23, 2016