Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement where api #298

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

implement where api #298

wants to merge 15 commits into from

Conversation

Liyuan-Chen-1024
Copy link
Contributor

fix #224

TODO:
(1) Accept types Field, np.ndarray, list, tuple, for where function and methods
(2) where always returns a MemField, in case of inplace return self after changing internals to match type if necessary, otherwise returning fresh field
(3) categorical is typed down to integer, treat timestamp as float64 fields
For checking for multiple types: isinstance(cond, (list, tuple, np.ndarray))

@codecov-commenter
Copy link

codecov-commenter commented May 17, 2022

Codecov Report

Merging #298 (6da082d) into master (1267885) will increase coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master     #298      +/-   ##
==========================================
+ Coverage   83.24%   83.26%   +0.01%     
==========================================
  Files          22       22              
  Lines        6149     6287     +138     
  Branches     1247     1273      +26     
==========================================
+ Hits         5119     5235     +116     
- Misses        734      749      +15     
- Partials      296      303       +7     
Impacted Files Coverage Δ
exetera/core/abstract_types.py 63.35% <50.00%> (-0.10%) ⬇️
exetera/core/fields.py 90.08% <73.56%> (-0.57%) ⬇️
exetera/core/operations.py 87.00% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1267885...6da082d. Read the comment docs.

@Liyuan-Chen-1024
Copy link
Contributor Author

Liyuan-Chen-1024 commented May 17, 2022

Here is the matrix that represent the field type result of where with different pairs of input fields. (col, row) pair reprensent (a, b) in where(cond, a, b)

NumericField CategoricalField IndexedStringField FixedStringField
NumericField NumericMemF NumericMemF IndexedStringMemF IndexedStringMemF
CategoricalField NumericMemF NumericMemF IndexedStringMemF IndexedStringMemF
IndexedStringField IndexedStringMemF IndexedStringMemF IndexedStringMemF IndexedStringMemF
FixedStringField IndexedStringMemF IndexedStringMemF IndexedStringMemF FixedStringMemF

@ericspod
Copy link
Member

The first part of the TODO was to accept list, tuple, and any sort of ndarray, not just bool arrays. Can we make that change?

raise NotImplementedError("Where does not support condition on indexed string fields at present")
cond = cond.data[:]
elif callable(cond):
raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: mehthod -> method

a = a.data[:]
if isinstance(b, Field):
b = b.data[:]
return np.where(cond, a, b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still returning a numpy array rather than a field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still returning a numpy array rather than a field

The logic of module-level where API will be almost same as instance-level where API. Think we can focus on one first, e.g. instance-level where API.

@@ -143,6 +161,41 @@ def _ensure_valid(self):
if not self._valid_reference:
raise ValueError("This field no longer refers to a valid underlying field object")

def where(self, cond:Union[list, tuple, np.ndarray, Field], b, inplace=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the callable signature to cond's type information

result_mem_field.data.write(result_ndarray)

elif isinstance(self, (IndexedStringField, FixedStringField)) or isinstance(b, (IndexedStringField, FixedStringField)):
result_mem_field = IndexedStringMemField(self._session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. Why are we causing an operation with fixed string field to output an indexed string field?
It doesn't make the logic much more complicated. Also, I would make that a separate method probably, because I can imagine us needing it elsewhere in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FixedStringField, you can refer to the matrix I listed above. Only when two FixedStringField will generate FixedStringField, otherwise it will be IndexedStringField.

@atbenmurray
Copy link
Member

@atbenmurray atbenmurray reopened this May 30, 2022
@atbenmurray
Copy link
Member

Sorry, accidental close

@atbenmurray
Copy link
Member

atbenmurray commented May 30, 2022

Fixed string type promotion should not result in indexed strings, I think. Here is a revised version of the table below:

@atbenmurray
Copy link
Member

a b result notes
numeric numeric numeric
categorical numeric numeric
categorical categorical numeric we could support categorical of the same dictionary if the categorcial types are identical
fixed string numeric fixed string type is 'S' where max is max of longest numeric representation and fixed string length
fixed string categorical fixed string see above (and treat categorical like numeric)
fixed string fixed string fixed string longest fixed string
indexed string numeric indexed string
indexed string categorical indexed string
indexed string fixed string indexed string
indexed string indexed string indexed string


result_mem_field = None

if isinstance(self, IndexedStringField) and isinstance(b, IndexedStringField):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing the type checking need to check that it's one of two types: isinstance(self, (IndexedStringField, IndexedStringMemField)).

cond = cond(self.data[:])
else:
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could just do return where(cond, self, b) and then the rest of the body of this method can be put into the global where function.

other_field_row_count = len(other_field.data[:])
data_converted_to_str = np.where([True]*other_field_row_count, other_field.data[:], [""]*other_field_row_count)
maxLength = 0
re_match = re.findall(r"<U(\d+)|S(\d+)", str(data_converted_to_str.dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U can be <U or >U

@@ -2169,6 +2169,167 @@ def test_indexed_string_isin(self, data, isin_data, expected):
np.testing.assert_array_equal(expected, result)


WHERE_BOOLEAN_COND = RAND_STATE.randint(0, 2, 20).tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing tests for when cond is a field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently unittest for cond is a field is missing. I'm trying to add one.
So for the indexedstringfield, we will throw out the exception.
How should we deal with the FixedStringField? As we can't use string as boolean value directly, so which case should be considered True for fixedstringfield, and which case is False?

WHERE_FIXED_STRING_TESTS = [
(lambda f: f > 5, "create_numeric", {"nformat": "int8"}, WHERE_NUMERIC_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA),
(lambda f: f > 2, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA),
(WHERE_BOOLEAN_COND, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2300: can we also do this for float32?

np.testing.assert_array_equal(result.data[:], expected_result)

# reload to test FixedStringMemField
a_mem_field, b_mem_field = a_field, b_field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to before the first subtest


expected_result = where_oracle(cond, a_field_data, b_field_data)

with self.subTest(f"Test instance where method: a is {type(a_field)}, b is {type(b_field)}"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to after the mem fields are created


# reload to test FixedStringMemField
a_mem_field, b_mem_field = a_field, b_field
if isinstance(a_field, fields.FixedStringField):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition can be removed

a_mem_field = fields.FixedStringMemField(self.s, a_kwarg["length"])
a_mem_field.data.write(np.array(a_field_data))

if isinstance(b_field, fields.FixedStringField):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition can be removed

b_mem_field = fields.FixedStringMemField(self.s, b_kwarg["length"])
b_mem_field.data.write(np.array(b_field_data))

with self.subTest(f"Test instance where method: a is {type(a_mem_field)}, b is {type(b_mem_field)}"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all four combinations:
a_field, b_field
a_field, b_mem_field
a_mem_field, b_field
a_mem_field, b_mem_field



@parameterized.expand(WHERE_INDEXED_STRING_TESTS)
def test_instance_field_where_return_indexed_string_mem_field(self, cond, a_creator, a_kwarg, a_field_data, b_creator, b_kwarg, b_field_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with combinations of hdf5 and mem fields

if isinstance(cond, (list, tuple, np.ndarray)):
cond = cond
elif isinstance(cond, Field):
if isinstance(cond, (NumericField, CategoricalField)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not checking for both hdf5 and mem field types

if isinstance(cond, (list, tuple, np.ndarray)):
cond = cond
elif isinstance(cond, Field):
if isinstance(cond, (NumericField, CategoricalField)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not checking both hdf5 and mem field types

if isinstance(cond, (list, tuple, np.ndarray)):
cond = cond
elif isinstance(cond, Field):
if isinstance(cond, (NumericField, CategoricalField)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not checking hdf5 and mem field types

Copy link
Member

@atbenmurray atbenmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these exception handling messages are fixed, I think we are good to go

else:
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
elif callable(cond):
raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"module method fields.where doesn't support callable cond parameter, please use the instance method where if you need to use a callable cond parameter"

if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
cond = cond.data[:]
else:
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

if l:
maxLength = int(l)
else:
raise ValueError("The return dtype of instance method `where` doesn't match '<U(\d+)' or 'S(\d+)' when one of the field is FixedStringField")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"The return dtype of instance method where doesn't match '<U(\d+)' or 'S(\d+)' when one of the fields is a fixed string field"

if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
cond = cond.data[:]
else:
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
cond = cond.data[:]
else:
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

elif callable(cond):
cond = cond(self.data[:])
else:
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)):
cond = cond.data[:]
else:
raise NotImplementedError("Where only support condition on numeric field and categorical field at present.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

elif callable(cond):
cond = cond(self.data[:])
else:
raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, please replace with:

"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

field.where functionality
4 participants