silverstripe-framework: BUG: String tokenizing methods return garbled content with some UTF-8 characters
Problem description
We stumbled across this, when trying to json_encode
some content from SilverStripe and the method suddenly returned nothing (although json_encode
should be capable of encoding any proper UTF-8 string).
The issue only arises when using methods that work on portions of strings, such as HTMLText::FirstSentence
, HTMLText::Summary
etc. When running content through some of these methods, you might end up with invalid UTF-8.
Affected Versions
SilverStripe Framework 2.x - 4.x
String Fieldtype classes (Text
, HTMLText
,…), Convert
class, maybe the shortcode-parser.
Severity
Edge-case, but has a high WTF-factor when it’s happening.
Source of the issue
The methods that work on substrings use methods that are not multibyte safe. In my tests, I found out that preg_replace
and preg_split
on whitespace (/\s+/
) can split a multibyte character in half, which will result in invalid UTF-8.
The problem only seems to occur if your system uses an UTF-8 version of a locale.
Illustration of the problem
// Must set an UTF-8 variant of a locale
setlocale(LC_ALL, "en_US.UTF-8");
// Generate a non-breaking space, U+00A0
$str = html_entity_decode(' ',ENT_COMPAT, 'UTF-8' );
// replace whitespace
echo preg_replace('/\s+/', ' ', $str);
This will output non-valid UTF-8, you’ll see the dreaded: �
Explanation
When using an UTF-8 locale, preg_replace
seems to detect the byte A0
as white-space and replace it with the byte 20
. So the byte sequence C2A0
(which is the non-breaking space), becomes C220
, which is not a valid UTF-8 character. It’s very likely, that other UTF-8 characters could be affected by this problem as well (not only the non-breaking space).
Possible fix
All instances of preg_replace('/\s+/', …)
and preg_split('/\s+/',…)
should be replaced with the multibyte-safe variants, mb_ereg_replace
and mb_split
. Other string operations should be looked into, preferably all string replacements and regular expressions should operate in a multibyte-safe manner.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 18 (18 by maintainers)
@silverstripe/core-team I’d like some feedback on which approach is best…
We currently require
mbstring
extension and so using the mbstring functions shouldn’t be a no-go.Do we want to:
u
flag to all ourpreg_*
functions; ORpreg_*
functions to mbstring equivalents?Further reading, using the
u
flag onpreg_*
methods can still be problematic when usingPREG_OFFSET_CAPTURE
(https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php/1725329#1725329)if we can patch and fix the issue with the
/u
flag and it solves your problem, then I see no reason against doing it, even if it’s an edge case.Output on CentOS 7.3:
@kinglozzer I’ve brought the impact down because this is quite an edge case and I feel we’d have heard about this a lot more if it were a high impact problem