Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDX Support #81

Open
panzacoder opened this issue Feb 3, 2023 · 13 comments
Open

MDX Support #81

panzacoder opened this issue Feb 3, 2023 · 13 comments
Labels
enhancement New feature or request

Comments

@panzacoder
Copy link

Surprised this hasn't been mentioned here before, but has anyone considered adding MDX support for this parser? I understand it may take some consideration to support MDX fully with JSX syntax recognition, but it would be really nice if this just also supported files with the .mdx extension.

Thoughts?

@panzacoder panzacoder added the enhancement New feature or request label Feb 3, 2023
@pappasam
Copy link

pappasam commented Feb 3, 2023

This would be amazing

@MDeiml
Copy link
Collaborator

MDeiml commented Feb 4, 2023

I guess half of this is already kind of implemented (highlighting-wise) since HTML tags should be recognized as such in most situations. The rest (import/export/statements in curly braces) also seem to be quite well defined and easy to implement. Also I already made this parser configurable using compile time environment variables, as such this shouldn't require forking the parser.

All that saying that everything is in place to implement this without too much effort. I don't think this will be a priority for me anytime soon as I don't have too much time to spend on this parser, but I will gladly help anybody trying to add this functionality.

@olrtg
Copy link

olrtg commented Feb 28, 2023

Hey! I was looking into how to extend this parser in a new project but it seems we can do this here.. I'm new to this whole tree-sitter parser authoring.. but I'm willing to learn and work on this if @MDeiml help me with this (mostly answering questions and pointing to the right resources). How can we move forward with this? I'm honestly surprised that MDX is not supported outside vscode since is quite popular.

@MDeiml
Copy link
Collaborator

MDeiml commented Feb 28, 2023

First of all I suggest we settle on a spec for .mdx file which seems to be this. For a coarse overview on how this parser is structured see the Code Overview section in CONTRIBUTING.md.

There are two syntactical elements, that need to be implemented on top of this parser:

  1. import / export statements (these are part of the "block" parser)
  2. expressions in curly braces (these are part of the "inline" parser)

For 1. most of the code needs to go in tree-sitter-markdown/src/scanner.cc. import / export statements should work similar as "atx headings" in the sense that they are leaf blocks (they can't contain any other blocks) and they are recognized by a keyword at the beginning of the line. I recommend that you mainly copy the code for atx headings in tree-sitter-markdown/src/scanner.cc and tree-sitter-markdown/grammar.js and take it from there.

For 2. most of the code needs to go in tree-sitter-markdown-inline/src/scanner.cc. The curly braces work similar to "latex blocks" because they are also leaf elements (can't contain other inline elements) and are recognized by their opening and closing delimiters. Again I would recommend copying the code for latex blocks in tree-sitter-markdown-inline/src/scanner.cc and tree-sitter-markdown-inline/grammar.js and taking it from there.

You can test your code by adding files to tree-sitter-markdown/corpus and tree-sitter-markdown-inline/corpus respectively and running npm test.

Finally, mdx should probably be treated as an optional extension, so you can have a look at the beginning of common/grammar.js on how extensions are treated, but you can leave that until later.

(I'd like to emphasize again that you should mainly copy code and modify it to your needs. There are some intricacies to this parser that stem from the restrictions tree-sitters parsing model and that lead to some pretty weird and ugly code, but I'd like to avoid getting into that.)

Sry for the wall of text. If you have any more questions feel free to ask :)

@olrtg
Copy link

olrtg commented Feb 28, 2023

Thanks for the quick reply! Yeah they seem to have a specification in a github repo.. I think we can use the link you've provided plus the mentioned link to work on this.

I'll checkout the structure of the project in detail after work. Will reach again soon, probably with some questions.

@olrtg
Copy link

olrtg commented Mar 2, 2023

Hello! I think I'm starting to understand the workflow with this project in particular (and treesitter in general).. so I have a couple o questions just to confirm:

  1. You want me to follow the same code pattern that you have been using in this project which it seems to be (at least for the majority of rules) to handle the tokens with a external scanner.. so i just need to clone a leaf block in the grammar.js file for creating these new tokens (for example import/export) but do the actual parsing in the scanner.cc file without using the treesitter grammar rules..is that correct?
  2. The import block AST has for example a import_statement which contains a list of named_imports and then the source (will leave a screenshot).. meaning it seems complex (to me at least).. you want me to copy/paste the code for this tokens from the jsx parser or should I just import the parser and reuse those tokens?

image

Thanks for your time again!

@MDeiml
Copy link
Collaborator

MDeiml commented Mar 2, 2023

  1. Yes, sorry I should have specified that. You are correct, most of the parsing sadly has to be done in the external scanners, so what you are describing is what I meant. (Be careful that the external tokens in grammar.js and scanner.cc have the same order.)

  2. Actually you don't have to implement the "finer" structure of import / export statements in this parser. Instead you can capture the whole statement and then leave it to the js grammar to figure out the structure through "language injections" (the import statement as a whole is valid js syntax). For this you will need to edit tree-sitter-markdown/queries/injection.scm, in there you can see for example how this parser similarly does not actually care about the finer structure of html and leaves it to the html grammar to figure it out.

    Language injections are not really part of tree-sitter, but rather a best practice that's implemented in probably all editors that use tree-sitter, see also this section in the docs.

@olrtg
Copy link

olrtg commented Apr 3, 2023

Hey! Sorry I took some time off from personal projects. Yesterday I came back to it but I think is just hard for me to learn all at once (treesitter and specially c) from an existing project. I'm going to create a new project and try to extend this parser with the grammar.js file. And if I'm successful I will try to add the mdx support in this repo directly in c. Thanks again for your time!

@olrtg
Copy link

olrtg commented Apr 8, 2023

Did you consider publishing this as an npm package? I see that some parsers do this and then other parsers use them (with npm install) to extend the parser. It'll be great for my use case 😄

@MDeiml
Copy link
Collaborator

MDeiml commented May 12, 2023

Sure, I could try, though I'm not familiar. Could you give me an example for a parser that does that?

@bennypowers
Copy link

bennypowers commented May 19, 2023

Please consider doing this as a separate parser. MDX is a separate language with different syntax and semantics. We do a lot of html-in-markdown at work, and I'm concerned that layering mdx on top of the existing semantics will break things in subtle ways.

@paulkre
Copy link

paulkre commented Nov 5, 2023

Neovim's default Markdown syntax highlighting is currently better at handling MDX content than the Treesitter version

@MDeiml MDeiml mentioned this issue Feb 25, 2024
@datner
Copy link

datner commented Jul 22, 2024

@bennypowers considering that jsx exists fine alongside html, and considering that astros mdx-like-kinda-but-not has a functioning tree-sitter despite supporting more complex semantics I can't think of what can go wrong. Do you have some patterns you think would be extra fragile?

UPDATE: There seems to be a pretty thorough spec for mdx (retrieved from the link) and it has explicit deviations from markdown. Meaning that not all valid markdown is valid mdx and not all valid mdx is valid markdown. This makes it impossible to avoid a different parser for MDX and MD since they are in essence different syntaxes.

I'm getting de-ja-vu from the same bikeshed about typescript and tsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants