Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for newlines inside tables #10

Open
sa3dany opened this issue Feb 24, 2020 · 8 comments
Open

Support for newlines inside tables #10

sa3dany opened this issue Feb 24, 2020 · 8 comments

Comments

@sa3dany
Copy link

sa3dany commented Feb 24, 2020

Thanks for creating this tool. I found that the only way to have newlines in markdown tables is to use the <br> html tag. I haven't looked at the code yet but if you think this can be done easily just point me in the right direction and I can try to implement this and open a pull request.

@sa3dany sa3dany changed the title Support fot newlines inside tables Support for newlines inside tables Feb 24, 2020
@Mr0grog
Copy link
Owner

Mr0grog commented Feb 26, 2020

Oh, interesting! I haven’t really spent much work on tables at all as part of this. This one might be slightly complicated. Here’s a quick breakdown of how things work:

  1. Read in the HTML from the left-hand pane and parse it into a Rehype tree (kind of like a DOM).

  2. Clean up that tree so it makes sense when converting to Markdown (GDocs makes some funny HTML). This happens in lib/fix-google-html.js.

  3. Convert the resulting tree into a Markdown AST for Remark using rehype-remark. (We do a little extra to clean up an issue with multiple redundant spaces in lib/rehype-to-remark-with-spaces.js.)

  4. Convert that AST into actual Markdown code using remark-stringify.

The vast majority of everything here happens in step 2. I took a quick look at this, though, and the problem in this case is really in step 4 — that is, this is an issue with remark-stringify.

For example, this doc:

Screen Shot 2020-02-26 at 9 29 25 AM

Results in a Markdown AST with two paragraphs in the table cells after step 3:

Resulting mdast tree

The markdown AST syntax has a break node type and I tried throwing in a quick hack between steps 3 and 4 that converted the two paragraphs to two text nodes with a break in between them, but that resulted in two spaces between the text. 🙁

| Heading 1 | Heading 2 | Heading 3   |
| --------- | --------- | ----------- |
| Abc   def | One line  | Two   lines |
| Yes       | No        | Maybe       |

SO! I think there are two options here…

  1. Add custom visitors to remark-stringify to change how paragraphs are an other things are handled in table cells. See the docs here: https://www.npmjs.com/package/remark-stringify#extending-the-compiler

  2. Since what remark-stringify is doing here is clearly wrong, it might make sense to file an issue and implement a fix directly in remark-stringify.

Doing (1) would fix things pretty quickly here, but (2) would be a little more stable and long-term. (You could also do both!)

@wooorm
Copy link

wooorm commented Mar 1, 2020

@Mr0grog In markdown, you can’t have a newline in a table cell, so break nodes in table cells are replaced with spaces.

The question then is whether you’d want <br> elements in Markdown. If you’d really want that, see https://github.com/syntax-tree/hast-util-to-mdast#html-in-markdown

@Mr0grog
Copy link
Owner

Mr0grog commented Mar 1, 2020

@wooorm right! I definitely get that, sorry if I wasn’t clear.

I guess my thinking was that it seems wrong for stringifying the AST to be a [silently] lossy operation — if there’s a break node in the Markdown tree (as in this case), I’d expect the outputted Markdown to represent a line break somehow (e.g. with a <br>), or otherwise throw an error saying the tree is invalid (e.g. I should have had an html node or extended the compiler).

@wooorm
Copy link

wooorm commented Mar 2, 2020

I added a note that the break node should not appear in the content model of table cells.

I’d expect the outputted Markdown to represent a line break somehow (e.g. with a <br>)
or otherwise throw an error saying the tree is invalid (e.g. I should have had an html node or extended the compiler).

The goal of mdast is to be a lossy (it’s an ast, not a cst) and easy to handle format for markdown that handles user-provided data. “Silently” lossy is what it is: that’s the tradeoff made against complexity and crashing.
The break element in mdast represents a markdown break that can occur in a paragraphs with two spaces at the end of a line, or an “escaped” newline in CM; not a <br> element (that’s where html nodes are for). I think throwing an error would be bad.

hast-util-to-mdast favours readable non-html markdown. To illustrate, see how HTML forms are mapped from HTML to markdown. This isn’t always what you want, and that’s where the “HTML in Markdown” section comes in.

How important is the newline? People probably add enters accidentally to google docs. I’d say that if you’re transforming GDocs -> Markdown, you can only do so while losing info 🤷‍♂️

@Mr0grog
Copy link
Owner

Mr0grog commented Mar 2, 2020

The break element in mdast represents a markdown break that can occur in a paragraphs with two spaces at the end of a line, or an “escaped” newline in CM; not a <br> element (that’s where html nodes are for).

Oh! I thought that was the Markdown equivalent of a <br>, just like *whatever* is the Markdown equivalent of <em>whatever</em>. Is that a bad way to conceptualize it? (i.e. this represents a line break, however that has to be written in actual Markdown, based on context and Markdown syntax family. I definitely understand it’s a Markdown node and not an HTML one.)

How important is the newline? People probably add enters accidentally to google docs.

Hmmm, I definitely feel like I’ve seen people do formatting intentionally in tables like this, though. It seems hard to make a judgement on it in the abstract. @sa3dany do you have some use cases that you needed this for?

if you’re transforming GDocs -> Markdown, you can only do so while losing info 🤷‍♂️

Fair point! I naïvely assumed where I was losing info and where I should be worried about it would be in the hast-util-to-mdast step, rather than in the stringify step, and was a bit thrown by that. (e.g. I figured that if hast-util-to-mdast generated a tree with two paragraph nodes inside a tableCell node as above, I should expect remark-stringify to support it, and that most of my problems would be the hast-util-to-mdast step losing some data from the HTML, not the stringify step losing data from the AST.)

But I hear what you’re saying about this being an abstract tree, and that because the mdast universe is meant to be loose rather than strict, it should be on this program to handle the situation, rather than on mdast to figure out how to represent something that Markdown doesn’t handle well to begin with. 👍


@sa3dany are you still interested in working on this? If this is an important feature to you, I think the right answer here is one of:

  1. Use the handlers option to do some custom conversion for paragraphs in table cells when calling rehype2remark() in lib/rehype-to-remark-with-spaces.js:

    const convert = rehype2remark.apply(this, arguments);

  2. Create a new file in lib to wrap remark-stringify that extends the compiler to provide custom serialization for paragraph or break nodes inside tableCell nodes. It would replace this line in index.js:

    .use(stringify, {listItemIndent: '1'});

@wooorm
Copy link

wooorm commented Mar 6, 2020

Oh! I thought that was the Markdown equivalent of a <br>, just like *whatever* is the Markdown equivalent of <em>whatever</em>.

They are somewhat equivalent, but the reason of mdast’s existence is exactly because markdown does not always map directly to HTML. It’s a nuanced (perhaps pedantic) difference, but I think of it as: mdast represents only markdown, hast represent only HTML, and in different scenarios, you’d want to transform between them differently as well!

I naïvely assumed where I was losing info and where I should be worried about it would be in the hast-util-to-mdast step

I think your assumption is correct: hast-util-to-mdast should not but a break node in table cells. I don’t think it should use raw HTML either. So the only thing that makes sense is to use a whitespace, which then is the same as the current behaviour 🤷‍♂️
We could add a hook to handle brInTable and brAnywhereElse or so, but that’s very different than how options currently work.
hast-util-to-mdast should also turn the HTML content such as block quotes into phrasing content I guess.

@Mr0grog
Copy link
Owner

Mr0grog commented Mar 6, 2020

We could add a hook to handle brInTable and brAnywhereElse or so, but that’s very different than how options currently work.

What if there were a boolean setting for supportComplexMarkupInTables (or something along those lines) that output html nodes if there is content in table cells that can’t be represented in Markdown? I feel like the fundamental issue here is about more than just line breaks/paragraphs.

Or, more broad: a boolean strict setting (false by default) that causes hast-util-to-mdast to output html nodes in situations where it can’t come up with a reasonable Markdown alternative?

@Mr0grog
Copy link
Owner

Mr0grog commented Mar 6, 2020

FWIW, I’m not sure a special hook for brInTable and brAnywhereElse is necessary — I should be able to do the same by setting custom handlers for td, th, and br right now. (On the flip side, providing some contextual or parent-related information to the handlers might be nice, so I’d only have to provide a handler for br.)

Mr0grog added a commit that referenced this issue Mar 6, 2020
Plain Markdown syntax doesn't support more than one line in a table cell, but Google Docs lets you have many paragraphs or lines in each cell. This update supports multiple lines by adding a custom handler for table cell nodes which outputs MDast `html` nodes representing `<br>` tags to use in the table.

Fixes #10.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants