Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.reset_index puts units to dataframe index cells #231

Open
szaiserb opened this issue May 17, 2024 · 4 comments
Open

DataFrame.reset_index puts units to dataframe index cells #231

szaiserb opened this issue May 17, 2024 · 4 comments

Comments

@szaiserb
Copy link

szaiserb commented May 17, 2024

Bug description

DataFrame.set_index puts units to dataframe index cells. I was very surprised when I found out, and I currently need to work around it. For the actual dataframe data cells this behavior is clearly not intended (quote from docs):

If you ever see units in the cells of the DataFrame, something isn’t right.

Minimum example

import pint_pandas
import pandas as pd

df = pd.DataFrame({'a': [1.0, 2.0]})
df['a'] = df['a'].astype(pint_pandas.PintType(ureg.second))
print(df['a'])
print(df.set_index('a').index)

pint_pandas.show_versions()

Output:

0    1.0
1    2.0
Name: a, dtype: pint[second]
Index([1.0 second, 2.0 second], dtype='pint[second]', name='a')

{'numpy': '1.26.4', 'pandas': '2.2.1', 'pint': '0.23', 'pint_pandas': '0.5'}
@mflova
Copy link

mflova commented May 23, 2024

Why isn't this supposed to be the desired behaviour? This is the way pandas works. When you perform set_index over a column, not only the values are used as index but also its dtype

@andrewgsavage
Copy link
Collaborator

Seeing the units in the cells mean the data is stored as an array of quantities inside the PintArray , as opposed to an array of units or floats.

This looks like one of the PintArray init paths doesn't behave as expected

@szaiserb
Copy link
Author

szaiserb commented May 23, 2024

When you perform set_index over a column, not only the values are used as index but also its dtype

Using the column dtype for the index on .set_index() is perfect, however my expectation is to have type(df.index[0]) = float and df.index.dtype = pint[<unit>]. Then, df.index behaves largely like df[<column_name>]. Having type(df.index[0]) = pint[<unit>] would only be required on a mixed - type index (which I do not see any usecase for).

@andrewgsavage
Copy link
Collaborator

looks like it is a bug in pandas, index doesnt use the data's dtype's formating func
https://github.com/pandas-dev/pandas/blob/3b48b17e52f3f3837b9ba8551c932f44633b5ff8/pandas/core/indexes/base.py#L1411

This is as expected:

df = df.set_index('a',drop=False)
i = df.index
i.values

<PintArray>
[1.0, 2.0]
Length: 2, dtype: pint[second]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants