Feat(athena): Improve DDL query support #4099

erindru · 2024-09-10T03:37:50Z

Athena is really a split between two different engines. One handles most DDL (Hive) and the other handles the remaining DDL and all of the DML (Trino).

Since the Athena dialect already extends the Trino dialect, this PR:

Adds Hive tokens to the tokenizer
Tries to figure out (at generate time) if an Expression should be handled by Hive
If it should, the Hive Generator is used
Otherwise, the Athena Generator is used

Note that I was unable to find an example of a query that the Athena parser couldnt handle (as long as the Hive tokens were available at the tokenization stage) so only the Generator step contains the branching logic.

This approach maximizes code re-use since we already know how to deal with 99% of this syntax, just not within the same dialect

georgesittas · 2024-09-10T13:18:36Z

sqlglot/dialects/athena.py

+        first_token = raw_tokens[0]
+        if first_token.token_type == TokenType.CREATE:


I'm not sure if this approach is robust, for example what if you have leading CTEs? The first token would then be WITH. You could also have (redundant) leading semi-colons, etc.

Yeah I was anticipating needing to tweak this. I forgot about CTE's, i'll add support for that.

How would you deal with redundant semicolons? Would you first filter the token stream to just the ones we care about and then do the checks?

So if you have leading CTE's, the existing logic works and returns False. It doesnt need to check for them because only a SELECT query would have leading CTE's. All SELECT queries should use the Trino tokenizer, so returning False triggers the use of the Trino tokenizer.

If you have a CREATE TABLE .. AS WITH (...) SELECT query, a SELECT still appears in the tokens so this is still correctly detected as a CTAS and returns False which triggers the use of the Trino tokenizer

How do you test redundant semicolons?

Trying to parse something like ; CREATE SCHEMA FOO; just throws a parse error

So if you have leading CTE's, the existing logic works and returns False. It doesnt need to check for them because only a SELECT query would have leading CTE's. All SELECT queries should use the Trino tokenizer, so returning False triggers the use of the Trino tokenizer.

Ah, interesting, I didn't realize that - that simplifies the problem then.

How do you test redundant semicolons?

Using parse:

>>> import sqlglot >>> sqlglot.parse("; create schema foo ;;") [None, Create( this=Table( db=Identifier(this=foo, quoted=False)), kind=SCHEMA), None]

Tbh this is an edge case but good to handle anyway, you can skip the leading tokens until you find the first non-semicolon or something.

I struggled to write a test for this and eventually realized that the issues I had that I thought were parsing issues were really tokenization issues. It appears the Trino parser is capable of handling all the Hive DDL as long as it is tokenized correctly.

So I was able to remove the delegation to the Hive parser entirely and just worry about the generation side. I may need to revisit this in future but I wasn't able to find a query to add to the tests that uses Hive syntax and also causes the Trino parser to fail

sqlglot/dialects/athena.py

georgesittas

LGTM - happy to merge and address semicolons later if we wanna get this in ASAP. Should be a low lift though.

erindru added 2 commits September 10, 2024 03:32

Feat(athena): Improve DDL query support

5ff7c22

Remove usage of walrus operator

1da1fd1

georgesittas reviewed Sep 10, 2024

View reviewed changes

georgesittas approved these changes Sep 11, 2024

View reviewed changes

PR feedback

6038a60

erindru force-pushed the erin/athena-create-table-ddl branch from a95ca19 to 6038a60 Compare September 11, 2024 21:37

erindru merged commit 7ac21a0 into main Sep 12, 2024
6 checks passed

erindru deleted the erin/athena-create-table-ddl branch September 12, 2024 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(athena): Improve DDL query support #4099

Feat(athena): Improve DDL query support #4099

erindru commented Sep 10, 2024 •

edited

Loading

georgesittas Sep 10, 2024 •

edited

Loading

erindru Sep 10, 2024

erindru Sep 10, 2024 •

edited

Loading

erindru Sep 10, 2024

georgesittas Sep 11, 2024

erindru Sep 11, 2024

georgesittas left a comment

		first_token = raw_tokens[0]
		if first_token.token_type == TokenType.CREATE:

Feat(athena): Improve DDL query support #4099

Feat(athena): Improve DDL query support #4099

Conversation

erindru commented Sep 10, 2024 • edited Loading

georgesittas Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

erindru Sep 10, 2024

Choose a reason for hiding this comment

erindru Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

erindru Sep 10, 2024

Choose a reason for hiding this comment

georgesittas Sep 11, 2024

Choose a reason for hiding this comment

erindru Sep 11, 2024

Choose a reason for hiding this comment

georgesittas left a comment

Choose a reason for hiding this comment

erindru commented Sep 10, 2024 •

edited

Loading

georgesittas Sep 10, 2024 •

edited

Loading

erindru Sep 10, 2024 •

edited

Loading