Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour missing value treatment "most_risky" / "least_risky" #117

Open
dlaprins opened this issue Apr 15, 2024 · 0 comments
Open

Comments

@dlaprins
Copy link

When using a BucketingProcess, the treatment of missing values is determined by specifying missing_treatment for both the prebucketer and the bucketer. Let's consider using OptimalBucketer as the bucketer.

The functionality that would be desirable is to be able to use BucketingProcess to place missing values in the most risky bucket. This is currently not possible. When setting missing_treatment = "most_risky" for both a prebucketer and OptimalBucketer, it need not be the case that the BucketingProcess as a whole places missing values in the most risky bucket.

Consider the following situation:

  • Let X be a numerical feature with a non-monotonic relation to target y.
  • Let N be the number of prebuckets. Let riskiness in the buckets be descending (i.e., riskiest bucket after OptimalBucketer is 0).

Then what can happen is the following:

  • The prebucketer places missing values in some prebucket i with 0 < i < N
  • OptimalBucketer sees no missing values, since X is already prebucketed. When merging prebuckets, it can happen that bucket i is not merged with bucket 0. As a result, missing values are not in bucket 0 which is the riskiest bucket.

It sounds a bit hypothetical, but it actually occurred for on two separate occasions for me now. It is both unintuitive and undesirable.

Suggested solution: add a missing_treatment parameter to BucketingProcess which allows missing values to be reassigned after the prebucketer and bucketer have been applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant