-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow skipping utf8 converison in Python3 #548
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, left a few comments.
It would be also nice to have some tests about getting the expected Python types for the different values of this parameter. There are already some similar tests in test_data_types.py
impala/hiveserver2.py
Outdated
@@ -1040,7 +1049,7 @@ def __init__(self, trowset, expect_more_rows, schema, convert_types=True): | |||
|
|||
# STRING columns are read as binary and decoded here to be able to handle | |||
# non-valid utf-8 strings in Python 3. | |||
if six.PY3: | |||
if six.PY3 and convert_strings_to_unicode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case when convert_strings_to_unicode is False but convert_types is True also needs to be consider - the code in _convert_values probably expects Unicode when converting DATE/TIMESTAMP/DECIMAL types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, please check out new design.
impala/hiveserver2.py
Outdated
@@ -96,6 +97,9 @@ def cursor(self, user=None, configuration=None, convert_types=True, | |||
When `False`, timestamps and decimal values will not be converted | |||
to Python `datetime` and `Decimal` values. (These conversions are | |||
expensive.) Only applies when using HS2 protocol versions > 6. | |||
convert_strings_to_unicode : bool, optional | |||
When `False`, string values will not be converted to unicode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type could be mentioned both for the True/False case.
- It should be mentioned that this has effect only in Python3.
impala/hiveserver2.py
Outdated
@@ -96,6 +97,12 @@ def cursor(self, user=None, configuration=None, convert_types=True, | |||
When `False`, timestamps and decimal values will not be converted | |||
to Python `datetime` and `Decimal` values. (These conversions are | |||
expensive.) Only applies when using HS2 protocol versions > 6. | |||
convert_strings_to_unicode : bool, optional | |||
When `True`, the following types will be converted to unicode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be mentioned that these are the types that are transmitted as strings in HS2 protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
impala/hiveserver2.py
Outdated
self._convert_strings_to_unicode(type_, is_null, values) | ||
if convert_strings_to_unicode: | ||
self._convert_strings_to_unicode(type_, is_null, values, | ||
types=["STRING", "LIST", "MAP", "STRUCT", "UNIONTYPE", "NULL", "VARCHAR", "CHAR", "TIMESTAMP", "DECIMAL", "DATE"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These type lists could be moved to a constant,e.g HS2_STRING_TYPES and CONVERTED_TYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, fixed
|
||
@fixture(scope='session') | ||
def cur_noconv(con): | ||
cur = con.cursor(convert_types=True, convert_strings_to_unicode=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and in test_data_types.py: is there a reason for removing the final new lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's by mistake, fixed
impala/tests/test_data_types.py
Outdated
setup_test_date_basic(cur, date_table) | ||
|
||
@pytest.mark.connect | ||
def test_date_basic_noconv(cur_noconv, date_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests more for types - date is a special case where string conversion is done regardless of convert_strings_to_unicode
impala/tests/conftest.py
Outdated
@@ -113,3 +113,9 @@ def cur(con): | |||
cur = con.cursor() | |||
yield cur | |||
cur.close() | |||
|
|||
@fixture(scope='session') | |||
def cur_noconv(con): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert_types is True here, so I think that cur_no_string_conversion or something similar would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
impala/tests/test_data_types.py
Outdated
|
||
@pytest.mark.connect | ||
def test_date_basic(cur, date_table): | ||
def setup_test_date(cur, date_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming doesn't make sense to me - this function does a select and checks the result, so it is not a setup function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
impala/tests/test_data_types.py
Outdated
try: | ||
yield table_name | ||
finally: | ||
cur.execute("DROP TABLE {0}".format(table_name)) | ||
|
||
|
||
@pytest.mark.connect | ||
def test_timestamp_basic(cur, timestamp_table): | ||
def setup_test_timestamp(cur, timestamp_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for dates, at this point this function is not about setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
setup_test_date(cur, date_table) | ||
|
||
@pytest.mark.connect | ||
def test_date_no_string_conv(cur_no_string_conv, date_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add similar test for decimals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
impala/tests/test_data_types.py
Outdated
@pytest.mark.connect | ||
def test_utf8_strings_no_string_conv(cur_no_string_conv): | ||
cur = cur_no_string_conv | ||
"""Use STRING/VARCHAR/CHAR values with multi byte unicode code points in a query.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this complex unicode handling is needed for the "no conversion" case. The main point should be to check that with unicode conversion the type returned is unicode, while without it it is bytes (in Python 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
@@ -49,6 +51,47 @@ def test_cursor_description_precision_scale(cur, decimal_table): | |||
for (exp, obs) in zip(expected, observed): | |||
assert exp == obs | |||
|
|||
|
|||
@fixture(scope='module') | |||
def decimal_table2(cur): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the existing decimal_table not enough for the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it more consistent with other tests this way, but it's really minor to change, if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
impala/tests/test_data_types.py
Outdated
assert result.decode("UTF-8", "replace") == u"�" | ||
|
||
if sys.version_info[0] < 3: | ||
assert result[0] == u"Test string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this fail if 'u' was not added? Python does automatic conversion of strings in many cases. We should check the exact type with isinstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
"""Read back a few decimal values in a wide range.""" | ||
cur.execute('select val from {0} order by val'.format(decimal_table)) | ||
results = cur.fetchall() | ||
assert results == [(Decimal('-999999999.999999999'),), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to have: it would increase the coverage of the tests if there would be also NULL values in these test tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, added
impala/tests/test_data_types.py
Outdated
result = cur.fetchone() | ||
|
||
if sys.version_info[0] < 3: | ||
assert not isinstance(result[0], unicode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance with str would be clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -49,6 +51,47 @@ def test_cursor_description_precision_scale(cur, decimal_table): | |||
for (exp, obs) in zip(expected, observed): | |||
assert exp == obs | |||
|
|||
|
|||
@fixture(scope='module') | |||
def decimal_table2(cur): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
#543