Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow skipping utf8 converison in Python3 #548

Merged
merged 14 commits into from
Aug 29, 2024
Merged

Conversation

mszjat
Copy link
Contributor

@mszjat mszjat commented Aug 12, 2024

Copy link
Collaborator

@csringhofer csringhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, left a few comments.
It would be also nice to have some tests about getting the expected Python types for the different values of this parameter. There are already some similar tests in test_data_types.py

impala/hiveserver2.py Outdated Show resolved Hide resolved
impala/hiveserver2.py Outdated Show resolved Hide resolved
@@ -1040,7 +1049,7 @@ def __init__(self, trowset, expect_more_rows, schema, convert_types=True):

# STRING columns are read as binary and decoded here to be able to handle
# non-valid utf-8 strings in Python 3.
if six.PY3:
if six.PY3 and convert_strings_to_unicode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case when convert_strings_to_unicode is False but convert_types is True also needs to be consider - the code in _convert_values probably expects Unicode when converting DATE/TIMESTAMP/DECIMAL types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, please check out new design.

@@ -96,6 +97,9 @@ def cursor(self, user=None, configuration=None, convert_types=True,
When `False`, timestamps and decimal values will not be converted
to Python `datetime` and `Decimal` values. (These conversions are
expensive.) Only applies when using HS2 protocol versions > 6.
convert_strings_to_unicode : bool, optional
When `False`, string values will not be converted to unicode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type could be mentioned both for the True/False case.

  • It should be mentioned that this has effect only in Python3.

@@ -96,6 +97,12 @@ def cursor(self, user=None, configuration=None, convert_types=True,
When `False`, timestamps and decimal values will not be converted
to Python `datetime` and `Decimal` values. (These conversions are
expensive.) Only applies when using HS2 protocol versions > 6.
convert_strings_to_unicode : bool, optional
When `True`, the following types will be converted to unicode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be mentioned that these are the types that are transmitted as strings in HS2 protocol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

self._convert_strings_to_unicode(type_, is_null, values)
if convert_strings_to_unicode:
self._convert_strings_to_unicode(type_, is_null, values,
types=["STRING", "LIST", "MAP", "STRUCT", "UNIONTYPE", "NULL", "VARCHAR", "CHAR", "TIMESTAMP", "DECIMAL", "DATE"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These type lists could be moved to a constant,e.g HS2_STRING_TYPES and CONVERTED_TYPES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, fixed


@fixture(scope='session')
def cur_noconv(con):
cur = con.cursor(convert_types=True, convert_strings_to_unicode=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and in test_data_types.py: is there a reason for removing the final new lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's by mistake, fixed

setup_test_date_basic(cur, date_table)

@pytest.mark.connect
def test_date_basic_noconv(cur_noconv, date_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests more for types - date is a special case where string conversion is done regardless of convert_strings_to_unicode

@@ -113,3 +113,9 @@ def cur(con):
cur = con.cursor()
yield cur
cur.close()

@fixture(scope='session')
def cur_noconv(con):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_types is True here, so I think that cur_no_string_conversion or something similar would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.


@pytest.mark.connect
def test_date_basic(cur, date_table):
def setup_test_date(cur, date_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming doesn't make sense to me - this function does a select and checks the result, so it is not a setup function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

try:
yield table_name
finally:
cur.execute("DROP TABLE {0}".format(table_name))


@pytest.mark.connect
def test_timestamp_basic(cur, timestamp_table):
def setup_test_timestamp(cur, timestamp_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for dates, at this point this function is not about setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

setup_test_date(cur, date_table)

@pytest.mark.connect
def test_date_no_string_conv(cur_no_string_conv, date_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add similar test for decimals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@pytest.mark.connect
def test_utf8_strings_no_string_conv(cur_no_string_conv):
cur = cur_no_string_conv
"""Use STRING/VARCHAR/CHAR values with multi byte unicode code points in a query."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this complex unicode handling is needed for the "no conversion" case. The main point should be to check that with unicode conversion the type returned is unicode, while without it it is bytes (in Python 3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@@ -49,6 +51,47 @@ def test_cursor_description_precision_scale(cur, decimal_table):
for (exp, obs) in zip(expected, observed):
assert exp == obs


@fixture(scope='module')
def decimal_table2(cur):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the existing decimal_table not enough for the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it more consistent with other tests this way, but it's really minor to change, if you insist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

assert result.decode("UTF-8", "replace") == u"�"

if sys.version_info[0] < 3:
assert result[0] == u"Test string"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this fail if 'u' was not added? Python does automatic conversion of strings in many cases. We should check the exact type with isinstance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"""Read back a few decimal values in a wide range."""
cur.execute('select val from {0} order by val'.format(decimal_table))
results = cur.fetchall()
assert results == [(Decimal('-999999999.999999999'),),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have: it would increase the coverage of the tests if there would be also NULL values in these test tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, added

result = cur.fetchone()

if sys.version_info[0] < 3:
assert not isinstance(result[0], unicode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance with str would be clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -49,6 +51,47 @@ def test_cursor_description_precision_scale(cur, decimal_table):
for (exp, obs) in zip(expected, observed):
assert exp == obs


@fixture(scope='module')
def decimal_table2(cur):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@csringhofer csringhofer merged commit 6e2ffa8 into cloudera:master Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants