silverstripe-framework: BUG: String tokenizing methods return garbled content with some UTF-8 characters

Problem description

We stumbled across this, when trying to json_encode some content from SilverStripe and the method suddenly returned nothing (although json_encode should be capable of encoding any proper UTF-8 string).

The issue only arises when using methods that work on portions of strings, such as HTMLText::FirstSentence, HTMLText::Summary etc. When running content through some of these methods, you might end up with invalid UTF-8.

Affected Versions

SilverStripe Framework 2.x - 4.x String Fieldtype classes (Text, HTMLText,…), Convert class, maybe the shortcode-parser.

Severity

Edge-case, but has a high WTF-factor when it’s happening.

Source of the issue

The methods that work on substrings use methods that are not multibyte safe. In my tests, I found out that preg_replace and preg_split on whitespace (/\s+/) can split a multibyte character in half, which will result in invalid UTF-8.

The problem only seems to occur if your system uses an UTF-8 version of a locale.

Illustration of the problem

// Must set an UTF-8 variant of a locale
setlocale(LC_ALL, "en_US.UTF-8");
// Generate a non-breaking space, U+00A0
$str = html_entity_decode('&nbsp;',ENT_COMPAT, 'UTF-8' );
// replace whitespace
echo preg_replace('/\s+/', ' ', $str);

This will output non-valid UTF-8, you’ll see the dreaded: �

Explanation

When using an UTF-8 locale, preg_replace seems to detect the byte A0 as white-space and replace it with the byte 20. So the byte sequence C2A0 (which is the non-breaking space), becomes C220, which is not a valid UTF-8 character. It’s very likely, that other UTF-8 characters could be affected by this problem as well (not only the non-breaking space).

Possible fix

All instances of preg_replace('/\s+/', …) and preg_split('/\s+/',…) should be replaced with the multibyte-safe variants, mb_ereg_replace and mb_split. Other string operations should be looked into, preferably all string replacements and regular expressions should operate in a multibyte-safe manner.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 18 (18 by maintainers)

Most upvoted comments

@silverstripe/core-team I’d like some feedback on which approach is best…

We currently require mbstring extension and so using the mbstring functions shouldn’t be a no-go.

Do we want to:

❤️ Add u flag to all our preg_* functions; OR
🎉 Move current preg_* functions to mbstring equivalents?

Further reading, using the u flag on preg_* methods can still be problematic when using PREG_OFFSET_CAPTURE (https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php/1725329#1725329)

dhensby on Jul 5, 2017

if we can patch and fix the issue with the /u flag and it solves your problem, then I see no reason against doing it, even if it’s an edge case.

dhensby on Jul 6, 2017

Output on CentOS 7.3:

Locale en_US.UTF-8 generates VALID UTF-8. Output: à
PHP: 7.1.6 / PCRE: 8.38 2015-11-23 / Linux

dhensby on Jul 6, 2017

@kinglozzer I’ve brought the impact down because this is quite an edge case and I feel we’d have heard about this a lot more if it were a high impact problem

dhensby on Jul 5, 2017