-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat(athena): Improve DDL query support #4099
Conversation
sqlglot/dialects/athena.py
Outdated
first_token = raw_tokens[0] | ||
if first_token.token_type == TokenType.CREATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this approach is robust, for example what if you have leading CTEs? The first token would then be WITH
. You could also have (redundant) leading semi-colons, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was anticipating needing to tweak this. I forgot about CTE's, i'll add support for that.
How would you deal with redundant semicolons? Would you first filter the token stream to just the ones we care about and then do the checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have leading CTE's, the existing logic works and returns False
. It doesnt need to check for them because only a SELECT
query would have leading CTE's. All SELECT
queries should use the Trino tokenizer, so returning False
triggers the use of the Trino tokenizer.
If you have a CREATE TABLE .. AS WITH (...) SELECT
query, a SELECT
still appears in the tokens so this is still correctly detected as a CTAS and returns False
which triggers the use of the Trino tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you test redundant semicolons?
Trying to parse something like ; CREATE SCHEMA FOO;
just throws a parse error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have leading CTE's, the existing logic works and returns False. It doesnt need to check for them because only a SELECT query would have leading CTE's. All SELECT queries should use the Trino tokenizer, so returning False triggers the use of the Trino tokenizer.
Ah, interesting, I didn't realize that - that simplifies the problem then.
How do you test redundant semicolons?
Using parse
:
>>> import sqlglot
>>> sqlglot.parse("; create schema foo ;;")
[None, Create(
this=Table(
db=Identifier(this=foo, quoted=False)),
kind=SCHEMA), None]
Tbh this is an edge case but good to handle anyway, you can skip the leading tokens until you find the first non-semicolon or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggled to write a test for this and eventually realized that the issues I had that I thought were parsing issues were really tokenization issues. It appears the Trino parser is capable of handling all the Hive DDL as long as it is tokenized correctly.
So I was able to remove the delegation to the Hive parser entirely and just worry about the generation side. I may need to revisit this in future but I wasn't able to find a query to add to the tests that uses Hive syntax and also causes the Trino parser to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - happy to merge and address semicolons later if we wanna get this in ASAP. Should be a low lift though.
a95ca19
to
6038a60
Compare
Athena is really a split between two different engines. One handles most DDL (Hive) and the other handles the remaining DDL and all of the DML (Trino).
Since the Athena dialect already extends the Trino dialect, this PR:
Note that I was unable to find an example of a query that the Athena parser couldnt handle (as long as the Hive tokens were available at the tokenization stage) so only the Generator step contains the branching logic.
This approach maximizes code re-use since we already know how to deal with 99% of this syntax, just not within the same dialect