markdoc: Support for inline and block HTML content

The commonMark spec describes how HTML blocks and Raw HTML should be treated:

4.6 HTML blocks

An HTML block is a group of lines that is treated as raw HTML (and will not be escaped in HTML output).

6.6 Raw HTML

Text between < and > that looks like an HTML tag is parsed as a raw HTML tag and will be rendered in HTML without escaping. Tag and attribute names are not limited to current HTML tags, so custom tags (and even, say, DocBook tags) may be used.

Example:

# Test <em>Emphasis</em>

<a href="http://github.com/markdoc/markdoc">GitHub</a>

Are there plans to support this aspect of the spec?

If so, it’s worth noting that other markdown implementations I’ve seen tend to follow the spec literally and do not parse HTML content in any way. In other words, raw HTML content is just treated as a string in the AST. The downside to this approach is that it prevents you from introspecting raw HTML content.

For example, if you wish to write a validator that ensures the integrity of links on a page you don’t really care whether the links are authored natively in markdown or as raw <a> tags. Likewise, when generating a table of contents, you want to generate IDs for and include all header tags.

Cheers, and congrats on the first release!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 14
  • Comments: 20 (8 by maintainers)

Commits related to this issue

Most upvoted comments

A few more use cases where it’s super useful to render unescaped HTML:

Custom syntax highlighting

Best-in-class highlighters like Shiki Twoslash render code (i.e. Markdoc fence content) to a string of HTML with highlighter classes and/or styles already applied.

Returning these raw HTML strings is a big DX improvement, instead of having to write additional code to parse, walk, and transform the HTML back into Markdoc nodes.

Live demo showing how to use Vite + Markdoc + Shiki Twoslash, returning a simple string of syntax highlighted HTML: ~https://stackblitz.com/edit/vitejs-vite-su1ym2?file=vite.config.ts~

Edit: updated that demo with a hacky workaround to support unescaped HTML for this case:

  1. Render the code fence to syntax-highlighted HTML with Shiki Twoslash
  2. Generate a unique ID, and cache it along with the HTML string
  3. Return the ID from transform
  4. After rendering Markdoc to an HTML, replace the unique ID with the cached HTML string

https://stackblitz.com/edit/vite-vue-3-markdoc-shiki-twoslash?file=vite.config.ts

Vue template generation

Since Vue templates are so similar to plain HTML, tools like Vitepress pass rendered Markdown (i.e. HTML strings) directly to the Vue template compiler. The rendered Markdown can include Vue components that are either written in the source .md file, or written with custom Markdown syntax and inserted automatically by the Markdown renderer.

See this project for an example of using a Markdown-It plugin to parse custom syntax and replace with Vue components.

Live demo showing how we could write Vue components in Markdown files if Markdoc supported unescaped HTML: https://stackblitz.com/edit/vite-vue-3-markdoc-shiki-twoslash?file=src%2Fhello-world.md

After a lot of debate and discussion about this particular feature, and after considering several different approaches to implementation, we have decided not to proceed with merging the PR.

We believe that the best way for users to natively support HTML content in Markdoc is to perform a transform on the token array between tokenization and parsing. This approach works significantly better and can be used today without requiring any changes to Markdoc. See the example here.

There are several major factors that have contributed to our decision to not move forward with the PR:

  • Supporting HTML in Markdoc is somewhat antithetical to our goals—Markdoc was designed to enforce separation of content, logic, and presentation. Rather than using arbitrary HTML markup, we think that users are generally better off explicitly defining custom tags that semantically describe the desired behavior. Given that the token transform approach provides a good escape hatch for specialized use cases where HTML support is absolutely needed, and that it can be implemented by users in a way that is tailored to their use case without requiring modification to Markdoc, we don’t really see a need to take further steps in Markdoc itself.
  • Although it is trivial to support arbitrary pass-through HTML string content in Markdoc’s HTML string renderer, the limitations of React make it difficult to support in the React renderer. React’s dangerouslySetInnerHTML has to be used on a parent element, which means there isn’t a good way to interleave arbitrary HTML string content around other React elements. The token transform approach bypasses this issue entirely and makes it possible to support the HTML content in React without having to rely on dangerouslySetInnerHTML.
  • In most cases, it seems like users who want to use raw HTML content in Markdoc also want to parse the HTML elements and make them participate directly in the Markdoc document hierarchy rather than treating them as unprocessed pass-through content. For that case, a token transform is a vastly better and more flexible approach than the pass-through approach in the PR.

The good news is that the token transform approach is relatively straightforward for users to implement. This approach consists of several steps:

  • Enable html support in the Markdoc tokenizer
  • After tokenization, iterate over the array of tokens, identify the ones that represent HTML content
  • Parse the HTML strings from those tokens and output a Markdoc opening or closing tag for each HTML opening and closing tag
  • Pass the tokens into the Markdoc parser and proceed as you would normally with Markdoc processing

I have published a full working example of this here. In that example, each HTML tag in the content is translated into an html-tag tag in the Markdoc AST—the HTML tag name and attributes become attributes on the Markdoc tag. You can simply write a custom transform function to control how the HTML tags are rendered, including emitting them literally in the output, which is what the example does.

Given the significant advantages of this approach over the implementation we were considering in the PR, we hope that those of you who want HTML support will be satisfied with this and not be too frustrated by our decision not to move forward with the PR. We plan to publish documentation that describes all of this in more detail at some point in the future so that users who want HTML support will know how to proceed.

I appreciate all of the feedback and thoughts that everyone shared on this feature. I particularly want to thank @alex-sherwin, whose workaround heavily influenced the example that I shared.

@marshall007: I found a workaround with React here until the PR of @rpaul-stripe is merged.

We could have it use dangerouslySetInnerHTML, but that requires the HTML content to be enclosed in another node of some kind. This works for standalone block or inline HTML content that is embedded in the document, but it’s going to break when the HTML content is interleaved with other Markdown content. We really need something like https://github.com/facebook/react/issues/12014 in order to support it natively in React. There are third-party libraries like interweave that are probably viable for this now, but I think the end user will want control over it.

I’ve worked around that issue by giving the ability to define the htmlWrapperTag by yourself.

// markdoc/tags.js

const UnescapedHtml = ({ htmlWrapperTag = 'div', children }) => {
  const html =
    typeof children === 'string'
      ? children
      : typeof children.props.children === 'string'
      ? children.props.children
      : children.props.children.join('')

  const CustomTag = htmlWrapperTag
  return <CustomTag dangerouslySetInnerHTML={{ __html: html }} />
}

const tags = {
  html: {
    render: UnescapedHtml,
    attributes: {
      htmlWrapperTag: { type: String },
      children: { type: String },
    },
  },
}

export default tags

Usage in .md file:

# Test {% html htmlWrapperTag="em" %} Emphasis {% /html %}

{% html htmlWrapperTag="em" %} Emphasis {% /html %}

{% html htmlWrapperTag="div" %}

<div>
    <div>
        <cite>Test Cite</cite>
    </div>
    <div>
        <a href="http://github.com/markdoc/markdoc">GitHub</a>
    </div>
</div>

{% /html %}

@rpaul-stripe I used the code from your demo, and it works perfectly, thank you! In my case, I had YouTube embeds in the markdown that I wanted to render and, as you can see on this page, your solution allows me to render those perfectly.

In my case, I was not able to use custom tag for those as the embed code is generated by an external system.

Sorry for the long delay on this. I updated the branch so that it can be merged and I did some additional testing to make sure that this will work as expected. There’s now a PR pending here that I hope to merge soon: https://github.com/markdoc/markdoc/pull/344

I also documented the proposed feature and drafted a formal RFP, which is here: https://github.com/markdoc/markdoc/discussions/343

The markdown-it library has an option (html: true) that can be enabled to get it to identify HTML content in Markdown according to the rules in the CommonMark specification. When this option is passed into the Markdoc tokenizer, a document with HTML content will fail to parse because Markdoc doesn’t define a corresponding HTML node type.

It’s fairly trivial to add an HTML node type to schema.ts so that we can capture these HTML strings and expose them in the Markdoc AST. We can add a new renderable tree type that is specifically for pass-through HTML content, and we can have the HTML node type output that during Markdoc’s transform phase.

It’s straightforward to support this in the HTML renderer, because we can just append the content to the output string when we encounter it in the renderable tree. The problem, however, is figuring out a good way to support this in the React renderer.

We could have it use dangerouslySetInnerHTML, but that requires the HTML content to be enclosed in another node of some kind. This works for standalone block or inline HTML content that is embedded in the document, but it’s going to break when the HTML content is interleaved with other Markdown content. We really need something like this feature in order to support it natively in React. There are third-party libraries like interweave that are probably viable for this now, but I think the end user will want control over it.

What I’m leaning towards doing is making this just work in the HTML renderer and making it so that the React renderers accept an extra parameter with a callback that allows the user to control how raw HTML renderable tree nodes are handled. Then the user would have the option of using Interweave or doing whatever sanitization they want on the HTML during rendering.

If so, it’s worth noting that other markdown implementations I’ve seen tend to follow the spec literally and do not parse HTML content in any way. In other words, raw HTML content is just treated as a string in the AST. The downside to this approach is that it prevents you from introspecting raw HTML content.

I think we’d follow the same approach by default, but it’s totally possible for a user to write an AST transform that walks over each HTML node in the AST, parses the string, and converts the actual markup to other Markdoc AST nodes.

@jerriep really sorry for the delay on this. It’s definitely still part of the roadmap. I will take another look at it this week and see if I can provide a clearer timeline for when this will be delivered.

Is this in the roadmap? I personally need it because I want to add an image with a custom width and alignment. And I don’t want to create a tag just for 1 image.

@mauriciabad we actually want to solve the issue you point out with this solution: https://github.com/markdoc/markdoc/issues/156

@jerriep yes, I am working on wrapping it up this week.

I am also very interested in seeing this working. @rpaul-stripe Do you have any idea whether (and when) your experimental branch will be merged?