Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#431: Moving and adding sampling to backend calculations #438

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

westernguy2
Copy link
Contributor

Overview

This is a branch that builds off of the work done in #432. This moves the sampling to after the Filter and also adds sampling for metadata computation.

Changes

Changes the execute function to move sampling after the Filtering is done, and so the sampling is done on each of the data visualizations. It also edits compute_data to sample the data before computing the metadata. All metadata is the metadata associated with the sample, not the full dataset.

Example Output

N/A

lux/executor/PandasExecutor.py Outdated Show resolved Hide resolved
@@ -443,16 +436,16 @@ def compute_data_type(self, ldf: LuxDataFrame):
ldf._data_type[attr] = "geographical"
elif pd.api.types.is_float_dtype(ldf.dtypes[attr]):

if ldf.cardinality[attr] != len(ldf) and (ldf.cardinality[attr] < 20):
if ldf.cardinality[attr] != ldf._length and (ldf.cardinality[attr] < 20):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between _length and len(df)? It is probably more general to use the latter since the _length might not be maintained correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed _length in the metadata for a LuxDataFrame, and found that it was not being used anywhere in the code base (as far as I could tell). On Line 544 of this file, I changed it to be the length of the sampled DataFrame. This is necessary since we don't save the sampled DataFrame after the metadata is computed, but the length of the sampled DataFrame is necessary for future calculations, especially ones related to cardinality, like the one here.

The name of the attribute is probably not the best, so I could maybe change it to _sampled_length instead?

@@ -538,11 +531,17 @@ def _is_datetime_number(series):
return False

def compute_stats(self, ldf: LuxDataFrame):
# use sample to compute statistics
if ldf._sampled is None:
ldf_sampled = PandasExecutor.execute_sampling(ldf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the config parameters that we are using for sampling for metadata and the visualization be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently they are the same (sampling_thresh). Should we maybe use different parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants