symphonycms: Input field: control characters must be stripped
As mentioned in #2024, control characters in Input fields will currently be saved to the database and also make their way to the handle of the field. Thus certain characters can even break pages.
This is Pandora’s box.
In XML 1.0, characters under U+0020 — except TAB
, CR
and LF
— must not occur. But as I posted in #2024, there is the DELETE
character as well (U+007F, 7-Bit-code 1111111, a historical control character). Many people say it’s a control character and should be stripped, but at the same time it’s a legal character in ISO/IEC 10646 which is part of the XML 1.0 specs. There are even more control codes that are valid in XML, see http://en.wikipedia.org/wiki/C0_and_C1_control_codes…
As you see, theoretically we must distinguish:
- “unwanted” characters (for a special field)
- illegal characters (in XML)
The second is simple, because the XML specs tell us:
… Consequently, XML processors must accept any character in the range specified for Char.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Given this one might discuss if removal of illegal characters should be done immediately before passing characters to the XSLT layer, which means “save whatever possible, but strip characters for XML purposes when reading data”. However, let’s keep things simple. I strongly assume that the above definition of valid characters would be useful for any templating layer, not only XML/XSLT. So our first decision might be:
- Shall we allow all characters from the XML specs? (Although, theoretically, there might still be problems with our current transliterations when building handles to be used in URLs.)
- Or shall we try and strip even more/different characters that we assume to be useless or dangerous?
The MySQL issue
But: Even with the character set set to UTF-8 (as Symphony does it), MySQL can not save any 4-byte characters. For more information on that matter please read:
- https://mathiasbynens.be/notes/mysql-utf8mb4
- http://stackoverflow.com/questions/8491431/remove-4-byte-characters-from-a-utf-8-string/16902461
So if we pass through only characters according to the XML specs, things can still break. (This happened to me once when I accidentally imported a 4-byte emoticon character from Facebook. Since then I strip 4-byte characters in the importer, before Symphony attempts to save these characters to the database. Of course, certain emoticons are lost.)
So if we talk about “valid characters” in Symphony, we come to the second decision:
- Shall we change Symphony’s database character set (again) to really work with all “valid” characters (in XML)?
- Or shall we strip 4-byte characters?
- Or shall we simply care for control characters and disregard the 4-byte issue?
The field issue
Not all fields must behave in the same way, For example, a TAB (=control character) is an unwanted character in an Input field, but perfectly fine for a Textarea field. So our third decision may be:
- Shall we build two stripping layers (one on the field level, one in the core routines for saving data)?
- Or shall we do it all in every field?
As you see, the distinction at the top of my post actually raises the right questions. Which characters are invalid in XML? Which characters are “invalid for a field”? Is the latter the same as “unwanted characters”?
How to strip characters
Just for reference, I am listing some REGEX examples .
Strip control characters only, excluding DEL:
$data = preg_replace( '/[\x00-\x1F]/', '', $data);
Strip control characters only, including DEL:
$data = preg_replace('/[\x00-\x1F\x7F]/', '', $data);
Strip 4-byte characters (see http://stackoverflow.com/a/16902461):
$data = preg_replace('/[\xF0-\xF4].../s', '', $data);
About this issue
- Original URL
- State: open
- Created 9 years ago
- Comments: 59 (59 by maintainers)
You mean PHP unicode support? Yes, I suggest to require this. It makes things much easier.
I thought about “not breaking XML (processing)”, and I assumed that it could/should be done “late”, i.e. when a text node gets added to XML. Simply because the “allowed character range” is XML specific. (I assumed that the restriction might be different if Symphony was using other templating solutions.)
I have not included any (opinionated) whitespace handling (like I myself do with the Input field). I still think that this would be the task of the field itself, or a developer might hack it into the event(s). I also do not intend to “protect the database/system from unwanted characters” here. This would be a different story (if ever needed). All I try is to “not break XML”.
My test string
😍 is a 4-byte character in UTF-8. Plus, there were two
DEL
characters between “ipsu” and “m”! (Obviously GitHub replaces them, just like I want to do in Symphony!)Search with PHP regex
Look at the spec again:
The simplest search regex seems to be:
And if we add
DEL
a.k.a.\x7F
(because we also consider this an unwanted control character, even if it is allowed by the spec), we have:The replacement character in PHP regex
In order to output the REPLACEMENT CHARACTER in PHP 7+, we might use the “Unicode codepoint escape” syntax
"\u{FFFD}"
. However, this is not possible in PHP 5 — here the best way seems to bejson_decode('"\uFFFD"')
.Putting it together
Here’s my test file:
Again, GitHub replaces the
DEL
characters here. So if you want to use the code, you must re-add these control characters.The output of this file will be
Lorem 😍ipsu��m
. So far, so good.Symphony already has it — partly
I found out that since 2.3, Symphony already has
XMLElement::stripInvalidXMLCharacters
. But this function:Fortunately, the function is only used in
FrontendPage::sanitizeParameter
, so there’s probably no security issue caused by stripping the characters.I suggest to either improve this function or add another one. If we keep it like it is, it should be renamed (because it does not do what the name tells us).
Adding the regex replacement to Symphony
Now here comes the problem.
I have not found a simple way to add the replacement to frontend text nodes only, without also changing the string that is displayed in the backend’s textarea after saving (which means that when saving a second time, the replacememnt character will be saved). It would be even more problematic with frontend forms, which require data source output. Honestly, I think it is not possible at all to keep a control character in the database without ever outputting it to XML — because Symphony uses XML internally.
So I guess that the idea to keep the input “as it is” and apply the character range filtering for output only has failed.
Still, I hope that my findings will help us to come up with a (simple) solution. Here are the questions:
preg_replace
? (Suggested answer: “Yes”.)