linguist: does not work syntax highlighting when used with \b and the Russian alphabet

Good afternoon. Create syntax highlighting for 1C:Enterprise, which supports the Russian key words. When we use \b(Если|If)\b in github https://github-lightshow.herokuapp.com keywords are not highlighted. Do like this (?<=[^\w-а-яё\.]|^)(Если|If)(?=[^\w-а-яё\.]|$) works. Files in UTF-8.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 48 (21 by maintainers)

Most upvoted comments

Ah right, that makes sense then. I wasn’t aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I’d say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise’s one does). Personally, I feel that’s a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

@worldbeater I doubt the PCRE version has anything to do with this. It’s running in ASCII-only mode, that’s all I know (I’m not staff). But of check the manpage for pcresyntax(3), you’ll find documentation for variances between regular expression engines (starting at the section “Backreferences”).

These are the likely discrepancies that that’re affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax Description Engines
(?R) recurse whole pattern All
(?n) call subpattern by absolute number All
(?+n) call subpattern by relative number All
(?-n) call subpattern by relative number All
(?&name) call subpattern by name Perl, PCRE
(?P>name) call subpattern by name Python
\g<name> call subpattern by name Oniguruma
\g'name' call subpattern by name Oniguruma
\g<n> call subpattern by absolute number Oniguruma
\g'n' call subpattern by absolute number Oniguruma
\g<+n> call subpattern by relative number PCRE
\g'+n' call subpattern by relative number PCRE
\g<-n> call subpattern by relative number PCRE
\g'-n' call subpattern by relative number PCRE

If @vmg pings @arfon for an answer, I’m outta here.

Hahaha. I can answer this: we don’t currently use the PCRE_UCP flag because it’s a really significant performance degradation. I acknowledge this is not an ideal answer – I’m looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can’t enable it by default because it really slows down syntax highlighting. 😢

I don’t have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

(apologies in advance, i don’t intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at https://github.com/atom/language-csharp/issues/112#issuecomment-379094384 by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

Don’t screw it up. The Russians are watching.

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you’d be asking us to provide support for PCRE’s “complicated syntax” too, because the Oniguruma engine isn’t good enough.

One more ping, guys. (sorry)

i think this issue should be opened. the problem still exists.

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I’d rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

ersonally, I feel that’s a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

@arfon / @bkeepers Do you guys know what it’ll take to get PCRE’s PCRE_UCP flag used on the site?

@vmg may be able to answer this

Okay… Is it easy to change one engine to another? Can we just ask the Github guys to make this change?