Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(athena): Improve DDL query support #4099

Merged
merged 3 commits into from
Sep 12, 2024
Merged

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Sep 10, 2024

Athena is really a split between two different engines. One handles most DDL (Hive) and the other handles the remaining DDL and all of the DML (Trino).

Since the Athena dialect already extends the Trino dialect, this PR:

  • Adds Hive tokens to the tokenizer
  • Tries to figure out (at generate time) if an Expression should be handled by Hive
  • If it should, the Hive Generator is used
  • Otherwise, the Athena Generator is used

Note that I was unable to find an example of a query that the Athena parser couldnt handle (as long as the Hive tokens were available at the tokenization stage) so only the Generator step contains the branching logic.

This approach maximizes code re-use since we already know how to deal with 99% of this syntax, just not within the same dialect

Comment on lines 13 to 14
first_token = raw_tokens[0]
if first_token.token_type == TokenType.CREATE:
Copy link
Collaborator

@georgesittas georgesittas Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this approach is robust, for example what if you have leading CTEs? The first token would then be WITH. You could also have (redundant) leading semi-colons, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was anticipating needing to tweak this. I forgot about CTE's, i'll add support for that.

How would you deal with redundant semicolons? Would you first filter the token stream to just the ones we care about and then do the checks?

Copy link
Collaborator Author

@erindru erindru Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you have leading CTE's, the existing logic works and returns False. It doesnt need to check for them because only a SELECT query would have leading CTE's. All SELECT queries should use the Trino tokenizer, so returning False triggers the use of the Trino tokenizer.

If you have a CREATE TABLE .. AS WITH (...) SELECT query, a SELECT still appears in the tokens so this is still correctly detected as a CTAS and returns False which triggers the use of the Trino tokenizer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you test redundant semicolons?

Trying to parse something like ; CREATE SCHEMA FOO; just throws a parse error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you have leading CTE's, the existing logic works and returns False. It doesnt need to check for them because only a SELECT query would have leading CTE's. All SELECT queries should use the Trino tokenizer, so returning False triggers the use of the Trino tokenizer.

Ah, interesting, I didn't realize that - that simplifies the problem then.

How do you test redundant semicolons?

Using parse:

>>> import sqlglot
>>> sqlglot.parse("; create schema foo ;;")
[None, Create(
  this=Table(
    db=Identifier(this=foo, quoted=False)),
  kind=SCHEMA), None]

Tbh this is an edge case but good to handle anyway, you can skip the leading tokens until you find the first non-semicolon or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled to write a test for this and eventually realized that the issues I had that I thought were parsing issues were really tokenization issues. It appears the Trino parser is capable of handling all the Hive DDL as long as it is tokenized correctly.

So I was able to remove the delegation to the Hive parser entirely and just worry about the generation side. I may need to revisit this in future but I wasn't able to find a query to add to the tests that uses Hive syntax and also causes the Trino parser to fail

sqlglot/dialects/athena.py Show resolved Hide resolved
sqlglot/dialects/athena.py Outdated Show resolved Hide resolved
sqlglot/dialects/athena.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - happy to merge and address semicolons later if we wanna get this in ASAP. Should be a low lift though.

@erindru erindru merged commit 7ac21a0 into main Sep 12, 2024
6 checks passed
@erindru erindru deleted the erin/athena-create-table-ddl branch September 12, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants