FIX-#431: Moving and adding sampling to backend calculations #438

westernguy2 · 2021-12-01T17:05:19Z

Overview

This is a branch that builds off of the work done in #432. This moves the sampling to after the Filter and also adds sampling for metadata computation.

Changes

Changes the execute function to move sampling after the Filtering is done, and so the sampling is done on each of the data visualizations. It also edits compute_data to sample the data before computing the metadata. All metadata is the metadata associated with the sample, not the full dataset.

Example Output

N/A

Signed-off-by: Kunal Agarwal <[email protected]>

lux/executor/PandasExecutor.py

dorisjlee · 2021-12-01T22:36:22Z

lux/executor/PandasExecutor.py

@@ -443,16 +436,16 @@ def compute_data_type(self, ldf: LuxDataFrame):
                    ldf._data_type[attr] = "geographical"
                elif pd.api.types.is_float_dtype(ldf.dtypes[attr]):

-                    if ldf.cardinality[attr] != len(ldf) and (ldf.cardinality[attr] < 20):
+                    if ldf.cardinality[attr] != ldf._length and (ldf.cardinality[attr] < 20):


What is the difference between _length and len(df)? It is probably more general to use the latter since the _length might not be maintained correctly.

I noticed _length in the metadata for a LuxDataFrame, and found that it was not being used anywhere in the code base (as far as I could tell). On Line 544 of this file, I changed it to be the length of the sampled DataFrame. This is necessary since we don't save the sampled DataFrame after the metadata is computed, but the length of the sampled DataFrame is necessary for future calculations, especially ones related to cardinality, like the one here.

The name of the attribute is probably not the best, so I could maybe change it to _sampled_length instead?

dorisjlee · 2021-12-01T22:37:35Z

lux/executor/PandasExecutor.py

@@ -538,11 +531,17 @@ def _is_datetime_number(series):
        return False

    def compute_stats(self, ldf: LuxDataFrame):
+        # use sample to compute statistics
+        if ldf._sampled is None:
+            ldf_sampled = PandasExecutor.execute_sampling(ldf)


Will the config parameters that we are using for sampling for metadata and the visualization be the same?

Yes, currently they are the same (sampling_thresh). Should we maybe use different parameters?

Signed-off-by: Kunal Agarwal <[email protected]>

westernguy2 added 7 commits November 10, 2021 00:31

FIX-lux-org#431: implement sampling threshold and edit tests and docs

f4441c5

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: small cleanup changes

6c82a88

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: move sampling to after the filtering

75ab18f

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: add sampling for metadata statistics computation

6b224a0

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: fix cardinality bug

6ea34a7

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: remove print statement

c26295d

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: fix bug in check_id_like

caf1fe1

Signed-off-by: Kunal Agarwal <[email protected]>

dorisjlee requested changes Dec 1, 2021

View reviewed changes

westernguy2 added 2 commits December 1, 2021 19:42

FIX-lux-org#431: remove filter_executed dictionary implementation

222037a

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: fix bug with caching sampled df

c696b2b

Signed-off-by: Kunal Agarwal <[email protected]>

westernguy2 force-pushed the move-sampling branch from 75e5f01 to c696b2b Compare January 31, 2022 20:03

westernguy2 added 2 commits February 1, 2022 21:24

FIX-lux-org#431: Move column filtering after sampling

35bd2d2

Signed-off-by: Kunal Agarwal <[email protected]>

FIX-lux-org#431: changed sampling method to removing rows

5252d91

Signed-off-by: Kunal Agarwal <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#431: Moving and adding sampling to backend calculations #438

FIX-#431: Moving and adding sampling to backend calculations #438

westernguy2 commented Dec 1, 2021

dorisjlee Dec 1, 2021

westernguy2 Dec 2, 2021

dorisjlee Dec 1, 2021

westernguy2 Dec 2, 2021

FIX-#431: Moving and adding sampling to backend calculations #438

Are you sure you want to change the base?

FIX-#431: Moving and adding sampling to backend calculations #438

Conversation

westernguy2 commented Dec 1, 2021

Overview

Changes

Example Output

dorisjlee Dec 1, 2021

Choose a reason for hiding this comment

westernguy2 Dec 2, 2021

Choose a reason for hiding this comment

dorisjlee Dec 1, 2021

Choose a reason for hiding this comment

westernguy2 Dec 2, 2021

Choose a reason for hiding this comment