Support for newlines inside tables #10

sa3dany · 2020-02-24T16:25:23Z

Thanks for creating this tool. I found that the only way to have newlines in markdown tables is to use the   html tag. I haven't looked at the code yet but if you think this can be done easily just point me in the right direction and I can try to implement this and open a pull request.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2020-02-26T18:05:45Z

Oh, interesting! I haven’t really spent much work on tables at all as part of this. This one might be slightly complicated. Here’s a quick breakdown of how things work:

Read in the HTML from the left-hand pane and parse it into a Rehype tree (kind of like a DOM).
Clean up that tree so it makes sense when converting to Markdown (GDocs makes some funny HTML). This happens in lib/fix-google-html.js.
Convert the resulting tree into a Markdown AST for Remark using rehype-remark. (We do a little extra to clean up an issue with multiple redundant spaces in lib/rehype-to-remark-with-spaces.js.)
Convert that AST into actual Markdown code using remark-stringify.

The vast majority of everything here happens in step 2. I took a quick look at this, though, and the problem in this case is really in step 4 — that is, this is an issue with remark-stringify.

For example, this doc:

Results in a Markdown AST with two paragraphs in the table cells after step 3:

The markdown AST syntax has a break node type and I tried throwing in a quick hack between steps 3 and 4 that converted the two paragraphs to two text nodes with a break in between them, but that resulted in two spaces between the text. 🙁

| Heading 1 | Heading 2 | Heading 3   |
| --------- | --------- | ----------- |
| Abc   def | One line  | Two   lines |
| Yes       | No        | Maybe       |

SO! I think there are two options here…

Add custom visitors to remark-stringify to change how paragraphs are an other things are handled in table cells. See the docs here: https://www.npmjs.com/package/remark-stringify#extending-the-compiler
Since what remark-stringify is doing here is clearly wrong, it might make sense to file an issue and implement a fix directly in remark-stringify.

Doing (1) would fix things pretty quickly here, but (2) would be a little more stable and long-term. (You could also do both!)

wooorm · 2020-03-01T11:24:52Z

@Mr0grog In markdown, you can’t have a newline in a table cell, so break nodes in table cells are replaced with spaces.

The question then is whether you’d want   elements in Markdown. If you’d really want that, see https://github.com/syntax-tree/hast-util-to-mdast#html-in-markdown

Mr0grog · 2020-03-01T18:41:29Z

@wooorm right! I definitely get that, sorry if I wasn’t clear.

I guess my thinking was that it seems wrong for stringifying the AST to be a [silently] lossy operation — if there’s a break node in the Markdown tree (as in this case), I’d expect the outputted Markdown to represent a line break somehow (e.g. with a  ), or otherwise throw an error saying the tree is invalid (e.g. I should have had an html node or extended the compiler).

wooorm · 2020-03-02T14:30:40Z

I added a note that the break node should not appear in the content model of table cells.

I’d expect the outputted Markdown to represent a line break somehow (e.g. with a  )
or otherwise throw an error saying the tree is invalid (e.g. I should have had an html node or extended the compiler).

The goal of mdast is to be a lossy (it’s an ast, not a cst) and easy to handle format for markdown that handles user-provided data. “Silently” lossy is what it is: that’s the tradeoff made against complexity and crashing.
The break element in mdast represents a markdown break that can occur in a paragraphs with two spaces at the end of a line, or an “escaped” newline in CM; not a   element (that’s where html nodes are for). I think throwing an error would be bad.

hast-util-to-mdast favours readable non-html markdown. To illustrate, see how HTML forms are mapped from HTML to markdown. This isn’t always what you want, and that’s where the “HTML in Markdown” section comes in.

How important is the newline? People probably add enters accidentally to google docs. I’d say that if you’re transforming GDocs -> Markdown, you can only do so while losing info 🤷‍♂️

Mr0grog · 2020-03-02T17:32:43Z

The break element in mdast represents a markdown break that can occur in a paragraphs with two spaces at the end of a line, or an “escaped” newline in CM; not a   element (that’s where html nodes are for).

Oh! I thought that was the Markdown equivalent of a  , just like *whatever* is the Markdown equivalent of whatever. Is that a bad way to conceptualize it? (i.e. this represents a line break, however that has to be written in actual Markdown, based on context and Markdown syntax family. I definitely understand it’s a Markdown node and not an HTML one.)

How important is the newline? People probably add enters accidentally to google docs.

Hmmm, I definitely feel like I’ve seen people do formatting intentionally in tables like this, though. It seems hard to make a judgement on it in the abstract. @sa3dany do you have some use cases that you needed this for?

if you’re transforming GDocs -> Markdown, you can only do so while losing info 🤷‍♂️

Fair point! I naïvely assumed where I was losing info and where I should be worried about it would be in the hast-util-to-mdast step, rather than in the stringify step, and was a bit thrown by that. (e.g. I figured that if hast-util-to-mdast generated a tree with two paragraph nodes inside a tableCell node as above, I should expect remark-stringify to support it, and that most of my problems would be the hast-util-to-mdast step losing some data from the HTML, not the stringify step losing data from the AST.)

But I hear what you’re saying about this being an abstract tree, and that because the mdast universe is meant to be loose rather than strict, it should be on this program to handle the situation, rather than on mdast to figure out how to represent something that Markdown doesn’t handle well to begin with. 👍

@sa3dany are you still interested in working on this? If this is an important feature to you, I think the right answer here is one of:

Use the handlers option to do some custom conversion for paragraphs in table cells when calling rehype2remark() in lib/rehype-to-remark-with-spaces.js:

google-docs-to-markdown/lib/rehype-to-remark-with-spaces.js

Line 36 in 47a3fd3

const convert = rehype2remark.apply(this, arguments);
Create a new file in lib to wrap remark-stringify that extends the compiler to provide custom serialization for paragraph or break nodes inside tableCell nodes. It would replace this line in index.js:

google-docs-to-markdown/index.js

Line 16 in 47a3fd3

.use(stringify, {listItemIndent: '1'});

wooorm · 2020-03-06T08:52:07Z

Oh! I thought that was the Markdown equivalent of a  , just like *whatever* is the Markdown equivalent of whatever.

They are somewhat equivalent, but the reason of mdast’s existence is exactly because markdown does not always map directly to HTML. It’s a nuanced (perhaps pedantic) difference, but I think of it as: mdast represents only markdown, hast represent only HTML, and in different scenarios, you’d want to transform between them differently as well!

I naïvely assumed where I was losing info and where I should be worried about it would be in the hast-util-to-mdast step

I think your assumption is correct: hast-util-to-mdast should not but a break node in table cells. I don’t think it should use raw HTML either. So the only thing that makes sense is to use a whitespace, which then is the same as the current behaviour 🤷‍♂️
We could add a hook to handle brInTable and brAnywhereElse or so, but that’s very different than how options currently work.
hast-util-to-mdast should also turn the HTML content such as block quotes into phrasing content I guess.

Mr0grog · 2020-03-06T22:08:25Z

We could add a hook to handle brInTable and brAnywhereElse or so, but that’s very different than how options currently work.

What if there were a boolean setting for supportComplexMarkupInTables (or something along those lines) that output html nodes if there is content in table cells that can’t be represented in Markdown? I feel like the fundamental issue here is about more than just line breaks/paragraphs.

Or, more broad: a boolean strict setting (false by default) that causes hast-util-to-mdast to output html nodes in situations where it can’t come up with a reasonable Markdown alternative?

Mr0grog · 2020-03-06T22:16:35Z

FWIW, I’m not sure a special hook for brInTable and brAnywhereElse is necessary — I should be able to do the same by setting custom handlers for td, th, and br right now. (On the flip side, providing some contextual or parent-related information to the handlers might be nice, so I’d only have to provide a handler for br.)

Plain Markdown syntax doesn't support more than one line in a table cell, but Google Docs lets you have many paragraphs or lines in each cell. This update supports multiple lines by adding a custom handler for table cell nodes which outputs MDast `html` nodes representing ` ` tags to use in the table. Fixes #10.

sa3dany changed the title ~~Support fot newlines inside tables~~ Support for newlines inside tables Feb 24, 2020

Mr0grog mentioned this issue Mar 6, 2020

Support multiple lines in table cells #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for newlines inside tables #10

Support for newlines inside tables #10

sa3dany commented Feb 24, 2020

Mr0grog commented Feb 26, 2020

wooorm commented Mar 1, 2020

Mr0grog commented Mar 1, 2020

wooorm commented Mar 2, 2020

Mr0grog commented Mar 2, 2020 •

edited

Loading

wooorm commented Mar 6, 2020

Mr0grog commented Mar 6, 2020

Mr0grog commented Mar 6, 2020

Support for newlines inside tables #10

Support for newlines inside tables #10

Comments

sa3dany commented Feb 24, 2020

Mr0grog commented Feb 26, 2020

wooorm commented Mar 1, 2020

Mr0grog commented Mar 1, 2020

wooorm commented Mar 2, 2020

Mr0grog commented Mar 2, 2020 • edited Loading

wooorm commented Mar 6, 2020

Mr0grog commented Mar 6, 2020

Mr0grog commented Mar 6, 2020

Mr0grog commented Mar 2, 2020 •

edited

Loading