ruff: Apply `ruff` to `markdown` code blocks

It would be cool to have the functionality to run ruff over python codeblocks like this does for black https://github.com/adamchainz/blacken-docs

About this issue

Original URL
State: open
Created a year ago
Reactions: 22
Comments: 23 (17 by maintainers)

Most upvoted comments

Formatting markdown code blocks is supported in v0.1.8: #9030

I believe that it is in the docstrings of functions as opposed to true code blocks in markdown that this issue was about

paddyroddy on Dec 26, 2023

for sanity it might make absolute sense to consider each markdown fragment a “own file” with a offset starting point

this capability might actually make it easier to make IDE/language server helpers, where one might want to reformat a function/class alone while editing a file (and the reformatting would basically work off a initial line/column/Position offset, plus the plain text of the lines intended for reformatting)

(im coming from https://github.com/entangled/entangled.py/issues/5)

RonnyPfannschmidt on Jul 29, 2023

Formatting markdown code blocks is supported in v0.1.8: https://github.com/astral-sh/ruff/pull/9030

konstin on Dec 26, 2023

I think the approach you propose sort of mimics LSPs in that each language can have different capabilities, implemented differently.

That’s a neat comparison!

Is it even worth continuing work on this before we can have those talks?

I’m not sure. What I outlined above is also only what I considered doing. There isn’t any alignment on the team. Let’s maybe first finish the fix refactor. That would allow me to focus on one side project only.

MichaReiser on Jul 3, 2023

@evanrittenhouse I was definitely considering them as discrete units when I raised this. For example here https://github.com/astro-informatics/sleplet/blob/c510795e7ebd91000b8afe53276e522b75cdfbb6/.github/workflows/examples.yml#L33 I use pytest-codeblocks to run some example code blocks (essentially acting as another type of test).

paddyroddy on Jun 30, 2023

Would it be sufficient to mutate the diagnostic.range or message.range directly by adding the offset?

Yep! I was working on it yesterday and that’s also the route I’ve chosen to go down. Sorry for the confusion, I should’ve waiting until I had a fully fleshed out design before weighing in.

My expectation is that pulling in pulldown-cmark increases the binary size because we include a new crate in our binary (that is 116 kB in size), or is pulldown-cmark already a (runtime) dependency ?

It’s not a runtime dependency, so it’ll increase the crate because of its size, yes. I meant that it’d be less than the size delta from installing treesitter and associated grammar(s) which aren’t totally necessary at this stage.

I think we can start prototyping with either and defer the decision to later.

Cool!

evanrittenhouse on May 1, 2023

Yup, that’s basically what I’m doing - adding a document_offset field to Messages and Diagnostics.

Which use cases require a diagnostic or message to know its document offset? Would it be sufficient to mutate the diagnostic.range or message.range directly by adding the offset? I’m asking because increasing the size of Message and Diagnostic decrease performance because:

More data that must be written to and read from the cache
Overall more data to write when creating/copying diagnostics and read when generating the output (even for diagnostics not using embedded languages)

Advantage of pulldown-cmark as mentioned above is it’s quite fast and won’t affect the binary size.

My expectation is that pulling in pulldown-cmark increases the binary size because we include a new crate in our binary (that is 116 kB in size), or is pulldown-cmark already a (runtime) dependency ?

Going with a treesitter integration requires installing the treesitter grammar for the language in question.

Yeah, using treesitter is more complicated because it requires a custom build step. I’m not too opinionated on if we should use treesitter or not. I think we can start prototyping with either and defer the decision to later.

MichaReiser on May 1, 2023

It seems easier to first handle only markdown and in a second time expand with the tree-sitter grammar.

JonathanPlasse on Apr 29, 2023

Yup, agree with all of your points @MichaReiser. As part of the design phase I’ve been thinking about how to generalize to different filetypes. It’s an interesting problem - I will take a look at Treesitter parsers as well. Going down that route, we could support different languages relatively easily. I am going to submit PRs for this incrementally so that we can iterate over the design. The first one will be a parser that pulls code blocks out of Markdown - I’m leaning towards using Treesitter since that architecture could make it easier to provide “plugins” for other languages in the future.

As far as the line mappings, I was planning on making a generic LineMap struct (or something along those lines) which would contain line numbers mapped to their offset in the containing document. That struct could then be reused for embedded languages, like you mentioned, mapping relative line numbers in the chunk of interest to the absolute line number of the document. It should provide a flexible way to treat code blocks differently, regardless of if it’s an embedded language, a markdown code block, a Jupyter cell, etc.

I’ll have to look into Messages and how they work before finalizing the code though. I’ve been reading through the Jupyter implementation as part of this.

evanrittenhouse on Apr 29, 2023

Using a markdown parser makes sense from my view. It may be worth to also consider tree-sitter because it supports many languages.

The part that’s unclear to me how we want to solve it in the short term is the mapping of column and line numbers in the Messageemitters. Ruff supports Jupyter-notebooks today, but the line-mapping is somewhat “hacked-in” to make it work, and I’m a bit reluctant to add more exceptions to the line-number mapping.

https://github.com/charliermarsh/ruff/blob/483f4799995d1aea7987e6f7ee403f0411060023/crates/ruff/src/message/text.rs#L70-L87

I assume a similar mapping will be necessary for Markdown files because we only pass the code block’s source to Ruff, but we should show users the absolute line number from the start of the Markdown document.

Long-term

Long-term, we’ll need to support the following two features.

Multi-language support

Ruff should support linting different file types. For example, Ruff should be able to lint SQL, python, and Markdown files. The filetype decides which specific linter Ruff uses to lint the file. This includes that the LSP is in sync with the file-types supported by the CLI and what extensions map to which languages. Rome supports this today.

Embedded Languages

The idea is that a document written in one language can contain code written in another language. Examples are:

Code blocks in markdown documents
SQL in Python code
Python code in Jupyter Notebooks

The markdown file handler would recognize code blocks and delegate to Ruff to decide how to parse, lint, and format the code block’s content. Ruff’s infrastructure would correctly map the line-numbers between the “virtual” documents and the “physical” documents on disk.

MichaReiser on Apr 29, 2023

I would vote to use a markdown parser for robustness as you suggested.

JonathanPlasse on Apr 27, 2023