slate: html-serializer doesn't work with nested blocks

Do you want to request a feature or report a bug?

A bug

What’s the current behavior?

const BLOCK_TAGS = {
  blockquote: 'quote',
  p: 'paragraph',
  //div: 'div'
}

const rules = [
  {
    deserialize(el, next) {
      const type = BLOCK_TAGS[el.tagName.toLowerCase()]
      if (!type) return
      return {
        kind: 'block',
        type: type,
        nodes: next(el.childNodes)
      }
    }
  }
]

const pureHtml = '<blockquote><div>a text<blockquote>inner quote</blockquote></div></blockquote>'
const initialValue = new HtmlSerializer({ rules: rules }).deserialize(pureHtml);

It only renders a text element, and I couldn’t see inner quote. See https://jsfiddle.net/oj53q1n2/26/

What’s the expected behavior?

We should see both text and quote.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (11 by maintainers)

Most upvoted comments

I’d be curious to hear more about why that validation rule exists at all. While I agree that there’s a conceptual correctness to it, there’s no such restriction in HTML. In addition to the example provided by @nghuuphuoc, the following is valid HTML that is “unrepresentable” in Slate.

<ul>
  <li>
    Text Content
    <ul>
      <li>Nested Text Content</li>
    </ul>
  </li>
</ul>

The implicit behavior of silently destroying content feels like it needs a strong justification and prominent documentation.

+12

pvande on Apr 11, 2018

After a bit more digging, the problem appears to be this constraint in the core Slate schema (which is enforced by Value.fromJSON inside the HTML deserializer):

/**
 * Only allow block nodes or inline and text nodes in blocks.
 *
 * @type {Object}
 */

{
  validateNode: function validateNode(node) {
    if (node.object != 'block') return;
    var first = node.nodes.first();
    if (!first) return;
    var objects = first.object == 'block' ? ['block'] : ['inline', 'text'];
    var invalids = node.nodes.filter(function (n) {
      return !objects.includes(n.object);
    });
    if (!invalids.size) return;

    return function (change) {
      invalids.forEach(function (child) {
        change.removeNodeByKey(child.key, { normalize: false });
      });
    };
  }
},

If the input HTML contains <div> tags and the Serializer rules convert those div to blocks rather than ignoring them, it’s easy to create a structure that will be ripped apart by the schema validation after parsing, because Slate does not allow blocks to have both block children and text / inline children and this is a very common <div> case.

My solution is here: https://gist.github.com/bengotow/f5408e9cb543f22409d033df58e34579. Before running the HTML deserializer, I traverse the DOM tree and ensure that divs, blockquotes, and other nodes converted to Slate blocks contain either text + inline children OR block children, wrapping children into blocks as necessary. Curious whether this would be welcomed as default behavior in some way (cc @ianstormtaylor).

+10

bengotow on Jan 8, 2018

Hey @kornil! After a bit more polish, I actually ended up switching to an approach that adds wrapping blocks, etc. to the resulting Slate graph before passing it through the normalizer, rather than changing the HTML before converting it. I think that’s preferable because it works with any HTML <> Slate mapping rather than relying on an assumed set of conversions.

You can find the latest code I’m using here: https://github.com/Foundry376/Mailspring/blob/master/app/src/components/composer-editor/conversion.jsx#L172. I also wrote code to join adjacent text nodes rather then letting Slate do it during normalization, which sped things up a LOT because it’s a simple transform and Slate “assumes the worst” when it runs a normalization step (and spends time re-finding the nodes, etc.)

bengotow on Mar 11, 2018

Hey folks, this is by design. Slate does not allow you to have mixed inline and block level content in the same node. A block can either contain all block nodes, or it can contain inline and text nodes. This is enforced in the core editor-level schema.

The reason for this is that it makes implementing editing behaviors much simpler. It allows you to avoid a whole class of issues and questions that crop up related to intermingling. I realize there are no restrictions on HTML, but that’s also what makes the native contenteditable behaviors so hard to standardize and predict.

If someone wants to open a pull request with a specific improvement to the docs for this, I’d be happy to merge it. I’m going to close this otherwise, since it’s not something that is a bug that we can address.

ianstormtaylor on Oct 3, 2018

Rough code here - https://gist.github.com/crisward/b61bd926d44c1e58d05f0c0c472262a4 There is a bit of sanitisation code mixed in with that method, I was using it when pulling in content from our older cms.

crisward on Dec 11, 2018