symfony: [Serializer][XmlEncoder] Don't wrap content in CDATA

Q A
Bug report? no
Feature request? yes
BC Break report? unsure
RFC? yes
Symfony version all

It would be good to improve XmlEncoder so that it does not wrap content in a CDATA section, but provide some means for the user to direct the encoder to wrap specified content in a CDATA.

The following function currently determines whether or not to wrap:-

/**
 * Checks if a value contains any characters which would require CDATA wrapping.
 *
 * @param string $val
 *
 * @return bool
 */
private function needsCdataWrapping($val)
{
    return 0 < preg_match('/[<>&]/', $val);
}

The problem with this function is that the “<”, “>” and “&” characters are a poor signal that the content should be wrapped in a CDATA. The xml-spec is somewhat clear about this:-

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively.

We should interpret this as meaning that the aforementioned characters appearing in textual content of an element must be escaped as entity or character references. We should not interpret it as meaning that such content should be wrapped in a CDATA.

That same doc has a good example of when to use CDATA section:-

<![CDATA[<greeting>Hello, world!</greeting>]]>

There should be a way to serialise to a CDATA section for such a use case as that example, but this decision should not be taken by the encoder in the manner done currently.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 23 (13 by maintainers)

Commits related to this issue

Most upvoted comments

How about:-

  • provide a serialisation context key that XmlEncoder can use to decide whether or not to wrap content in a CDATA section. BC could be maintained in v3 and v4 by treating the context key as true by default:-

      private function needsCdataWrapping(/*string $val*/): bool
      {
          return !isset($this->context['xml_cdata']) || true === $this->context['xml_cdata'];
      }
    
  • In v3 to v5 allow a custom Normalizer to control which content should be wrapped in CDATA, perhaps by using an object that wraps content

    // MyNormalizer.php
    public function normalize($object, $format = null, array $context = array())
    {
        return array(
            '#' => new CdataWrapper($object->getMyTextualContent()),
        );
    }
    

    so that XmlEncoder.selectNodeType can do:-

    } elseif ($val instanceof CdataWrapper) {
        return $this->appendCData($node, $val->getContent());
    } elseif (is_string($val) && $this->needsCdataWrapping(/*$val*/)) {
        return $this->appendCData($node, $val);
    } elseif (is_string($val)) {
        return $this->appendText($node, $val);
    } elseif (...
    
  • Finally, in v5, treat xml_cdata = false by default so that XmlEncoder creates normal, encoded content unless instructed otherwise

     private function needsCdataWrapping(/*string $val*/): bool
     {
         return isset($this->context['xml_cdata']) && true === $this->context['xml_cdata'];
     }
    

All of this means that:-

  • BC can be maintained
  • Users of the v3 and v4 API are able to say no to CDATA
  • Users of v5 API can still say yes to CDATA
  • Implementers of custom Normalizers can have control over exactly which content is wrapped in CDATA

Thank you @Simperfit and @nicolas-grekas for the reviews. Here are some references for the notion that XmlEncoder should only ever create a CDATA section when explicitly instructed to do so:-

You should almost never need to use CDATA Sections. xml.silmaril.ie

CDATA sections are commonly used for scripting language content and sample XML and HTML content msdn.microsoft.com

The two options are almost exactly the same. Here are your two choices:

    <html>This is &lt;b&gt;bold&lt;/b&gt;</html>
    <html><![CDATA[This is <b>bold</b>]]></html>

stackoverflow.com

CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section. … That is why CDATA sections should be used only for XML documents that are keyed in manually, where they contain code or XML as data. Enclosing these in a CDATA section greatly improves readability. But when XML is generated programmatically, CDATA sections should be avoided. wikipedia.org