Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original data stored in interpret explainer classes #368

Open
epetrovski opened this issue Jan 20, 2021 · 4 comments
Open

Original data stored in interpret explainer classes #368

epetrovski opened this issue Jan 20, 2021 · 4 comments

Comments

@epetrovski
Copy link

interpret is a very useful package for explaining ML using SHAP, thanks. But I have a legal issue that prohibits me from using this in a professional context.

It seems that the explainer classes contain original datasets in obscure places. For instance, if I fit explainer = TabularExplainer(model, data) I end up with all my original data in explainer.explainer.initialization_examples.original_dataset.

This is a fact that I think most users are simply unaware of and a big issue for professionals, like me, working under a GDPR regime. If asked, I need to be able to tell regulators exactly where my costumer's data is stored, and that answer should always be in a centralized and protected database and not hidden away in some python object that ends up getting uploaded to Azure ML Workshop or pickled and saved to a disk.

So my question is whether it is strictly necessary for interpret's explainer models to store the original data they were initialized on? If not, could you commit to stripping original data from explainer classes?

@interpret-ml
Copy link

Hi @epetrovski -- It seems you are using the interpret-community package because TabularExplainer is a class that only exists there. Transferring the issue to them for further response.

-InterpretML team

@interpret-ml interpret-ml transferred this issue from interpretml/interpret Jan 20, 2021
@gaugup
Copy link
Collaborator

gaugup commented Jan 21, 2021

@epetrovski thanks for raising the privacy concern here. I don't see the code in interpret-community where the customer's data is being cached in TabularExplainer. Could you maybe provide a code sample where we can see the caching of the raw dataset?

I looked at the code for TabularExplainer. My hunch is perhaps shap explainers cache the raw dataset which is something we don't control. Just a hunch. More may become clearer once you suply with the code sample.

Regards

@imatiach-msft
Copy link
Collaborator

@gaugup it is cached in the individual explainers (eg mimic explainer, see:

self.initialization_examples = initialization_examples
), it is used to put it on the explanation object, (eg see
kwargs[ExplainParams.INIT_DATA] = self.initialization_examples
).
Maybe we can add an option to remove it. However, without some data the visualization dashboard won't be useful at all. So I'm not sure what @epetrovski is suggesting we should do - since without the original dataset the explanation isn't very useful to the user. This is more of a PM question - maybe our PMs could take a look at this issue?

@epetrovski
Copy link
Author

Maybe we can add an option to remove it. However, without some data the visualization dashboard won't be useful at all. So I'm not sure what @epetrovski is suggesting we should do - since without the original dataset the explanation isn't very useful to the user. This is more of a PM question - maybe our PMs could take a look at this issue?

Couldn't you simply ask users to supply the entire dataset at the initialization of the dashboard in stead of caching all the data upfront before you even know whether the user is going to use a dashboard at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants