Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subdtype #247

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

subdtype #247

wants to merge 23 commits into from

Conversation

andrewgsavage
Copy link
Collaborator

@andrewgsavage andrewgsavage commented Aug 4, 2024

@andrewgsavage andrewgsavage marked this pull request as draft August 4, 2024 10:47
@andrewgsavage andrewgsavage marked this pull request as ready for review August 5, 2024 09:51
@andrewgsavage
Copy link
Collaborator Author

This is ready for looking at now.
@MichaelTiemannOSC I think this is necessary to close the issues listed at the top. There aren't as many changes to the code as I'd expected. Also most initialisation paths are unaffected (only change is when constructing through Series or DataFrame, the subdtype needs to be specified or it'll default to Float64) see https://github.com/hgrecco/pint-pandas/pull/247/files#diff-ec91e8b7897fd22a3f890861665f6e7035e8a8a2dadd216dfeb998fa85290249

@andrewgsavage andrewgsavage changed the title [WIP] subdtype subdtype Aug 5, 2024
if isinstance(values, _Quantity):
values = values.to(dtype.units).magnitude
elif isinstance(values, PintArray):
values = values._data
if isinstance(values, np.ndarray):
dtype = values.dtype
if dtype in ddtypemap:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned after the fact with respect to #137, this particular type of dtype handling caused me some grief. Now that I've sorted out what sort of grief it caused me, I'll be ready to review these changes more properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be possible to remove ddtypemap with this change. the only other place it's used is in to_numpy which could defer to the underlying ExtensionArray

@MichaelTiemannOSC
Copy link
Collaborator

There are two ways that these changes mess with my world. The first is when I craft an ndarray of things that contain np.nan values and they get converted to NA values (because Float64). The second is when I try to construct arrays with uncertainty values (uncertainty Variables cannot be converted to Float64).

In both cases, I run into this when I use astype("pint[units]") and I need to use either astype("pint[units][float64]") or astype("pint[units][object]") when dealing with pure floats or uncertainties, respectively.

@andrewgsavage
Copy link
Collaborator Author

note you can change the default from Float64 to float64 or any other dtype pandas can parse by changing `pint_pandas.pint_array.DEFAULT_SUBDTYPE = "Float64"

pandas devs were surprised we are using float64 and not Float64- which is because it hasn't been changed since early on!

I think those changes to construction are good as they'll make it more unamibgious what the underlying data is.

The astype point is something I hadn't considered. That's more verbose and means code must change depending on the underlying dtype. I think DataFrame.pint.to("units") could be added as an alternative (and is 5 fewer characters, a double win!) That's been asked for elsewhere. I'll do that as a seperate PR.

@andrewgsavage
Copy link
Collaborator Author

There's a few things on the testing side to change/look at:

  • check tests run with other dtypes aren't getting converted to floats
  • add tests mixing dtypes

@MichaelTiemannOSC
Copy link
Collaborator

note you can change the default from Float64 to float64 or any other dtype pandas can parse by changing `pint_pandas.pint_array.DEFAULT_SUBDTYPE = "Float64"

I'm an age-old C programmer, and this is how I think. I but I know it's become more fashionable to use setters and getters to affect what might be a private variable. But yes, I can change the default to "float64" for my world, and then I can simplify my try statements to use the default and fallback to [object] when using uncertainties are present.

@@ -859,10 +890,17 @@ def _create_comparison_method(cls, op):
return cls._create_method(op, coerce_to_dtype=False)

@classmethod
def from_1darray_quantity(cls, quantity):
def from_1darray_quantity(cls, quantity, subdtype=None):
Copy link
Collaborator

@MichaelTiemannOSC MichaelTiemannOSC Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of subdtype here is noted, but none of the callers of from_1darray_quantity make use of it. When I have two PintArrays (of 'EUR') for example, and I construct them as pint[EUR][float64] and then divide them to get a dimensionless ratio, the [float64] nature goes away and I get back a pint[dimensionless][Float64], which I don't want. The question is: should we fix all the callers of this function to pass the subdtype of the quantity as the subdtype value, and should that override the call on line 900 to pass the incoming subdtype of the pd.array call? Or should we fix this function to preserve the subdtype of the quantity.magnitude if the incoming subdtype is None and pass that on line 900 rather than rederiving it on line 901?

@MichaelTiemannOSC
Copy link
Collaborator

The latest changes still cause problems. At line 875 (function _binop, with coerce_to_dtype being True), we call cls.from_1darray_quantity(res). When res is a ndarray of float64, this takes us to line 900, which the call to pd.array returns as Float64 (pandas's preferred subdtype) rather than float64 (my program's preferred subdtype). A hack would be to test the special case in which we have an ndarray of float64 and if that is the preferred subdtype, to preserve that. Otherwise, let pd.array do what it does.

@andrewgsavage
Copy link
Collaborator Author

Ah I meant to go back and check that line. Does it behave correctly for you now?

@@ -872,7 +872,7 @@ def convert_values(param):

if coerce_to_dtype:
try:
res = cls.from_1darray_quantity(res)
res = cls.from_1darray_quantity(res, self.dtype.subdtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing self.dtype.subdtype here doesn't fix the problem that at line 901 pd.array can convert a float64 array into a Float64 array and line 902 will then take the subdtype from that, ignoring the subdtype passed in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you give a worked example? sounds like you've got a Quantity( NumpyExtensionArray([1.0, ...]) and I'm not sure how that's happened

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a worked example:

import pandas as pd
import numpy as np
import pint_pandas

xx = pd.Series([1.0, 2.0, np.nan], dtype="pint[km][float64]")
yy = pd.Series([2.0, 3.0, 4.0], dtype="pint[km][float64]")

print(xx.fillna(123.456))

print(xx.fillna(xx + yy))

The conversion of xx+yy from pint[km][float64] to pint[km][Float64] makes it unsuitable as a further source to a fillna operation.

@@ -864,15 +864,22 @@ def convert_values(param):
# a TypeError should be raised
res = op(lvalues, rvalues)

subdtype = self.dtype.subdtype
if "truediv" in op.__name__ and subdtype.kind == "i":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case self can be:

<PintArray>
[1581.2004963851048, 2322.7768972177428, 2065.2759837278822,
 2108.0321074762774,  1922.714515192993]
Length: 5, dtype: pint[USD * percent][float64]

self.dtype is pint[USD * percent][float64] and self.dtype.subdtype is 'float64'. But the string 'float64' does not have a kind attribute.

@scanzy
Copy link

scanzy commented Sep 16, 2024

Hey thank you for all your work on #247 !

So Float64 will be the new default sub-dtype?
However I find it really useful to have the ability to specify "pint[MW][Float64]" or "pint[MW][float64]" :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Values dtype
3 participants