MWEM Updates (#481)

* Testing with marginals * Make marginal queries * MWEM cuboids * Unit test for MWEM * Updates to MWEM * MWEM documentation * Prep for discrete histogramdd * Linter fixes * Doc tweaks Co-authored-by: Joshua <[email protected]>
opendp · Aug 10, 2022 · f8de737 · f8de737
1 parent 17030ab
commit f8de737
Show file tree

Hide file tree

Showing 8 changed files with 444 additions and 256 deletions.
diff --git a/synth/.gitignore b/synth/.gitignore
@@ -1 +1,2 @@
 poetry.lock
+run.py
diff --git a/synth/HISTORY.md b/synth/HISTORY.md
@@ -1,3 +1,15 @@
+# SmartNoise Synth v0.2.7 Release Notes
+
+## MWEM Updates
+
+* Support for measuring Cuboids. Cuboids include multiple disjoint queries that can be measured under a single iteration.
+* Default iterations and query count adapt based on dimensionality of source data
+* Support for measure-only MWEM, for small cubes with optimal query workloads
+* Basic accountant keeps track of spent epsilon
+* Removed bin edge support, since we delegate to preprocessor now
+* Better handles cases where exponential mechanism can't find a query. Should always find queries to measure now
+* Debug flag prints trace information
+
 # SmartNoise Synth v0.2.6 Release Notes
 
 * Support for MST synthesizer.

diff --git a/synth/VERSION b/synth/VERSION
@@ -1 +1 @@
-0.2.6
+0.2.7
diff --git a/synth/docs/source/index.rst b/synth/docs/source/index.rst
@@ -32,12 +32,27 @@ Multiplicative Weights Exponential Mechanism.  From "`A Simple and Practical Alg
   pums = pums.drop(['income'], axis=1)
   nf = pums.to_numpy().astype(int)
 
-  synth = snsynth.MWEMSynthesizer(epsilon=1.0, split_factor=nf.shape[1]) 
+  synth = snsynth.MWEMSynthesizer(debug=True)
   synth.fit(nf)
 
-  sample = synth.sample(10)
+  print(f"MWEM spent epsilon {synth.spent}")
+  sample = synth.sample(100)
   print(sample)
 
+MWEM maintains an in-memory copy of the full joint distribution, initialized to a uniform distribution, and updated with each iteration.  The size of the joint distribution in memory is the product of the cardinalities of columns of the input data, which may be much larger than the number of rows.  For example, in the PUMS dataset with income column dropped, the size of the in-memory histogram is 29,184 cells.  The size of the histogram can explode rapidly with multiple columns with high cardinality.  You can provide splits to divide the columns into independent subsets, which may dramatically reduce the memory requirement.
+
+The MWEM algorithm operates by alternating between a step that selects a poorly-performing query via exponential mechanism, and then updating the estimated joint distribution via the laplace mechanism, to perform better on the selected query.  Over multiple iterations, the estimated joint distribution will become better at answering the selected workload of queries.
+
+Because of this, the performance of MWEM is highly dependent on the quality of the candidate query workload.  The implementation tries to generate a query workload that will perform well.  You can provide some hints to influence the candidate queries.  By default, MWEM will generate workloads with all one-way and two-way marginals.  If you want to ensure three-way or higher marginals are candidates, you can use ``marginal_width``.  In cases where columns contain ordinal data, particularly in columns that are binned from continuous values, you can use ``add_ranges`` to ensure that candidate queries include range queries.  If range queries are more important that two-way marginals, for example, you can combine ``add_ranges`` with a ``marginal_width=1`` to suppress two-way marginals.
+
+Each iteration spends a fraction of the budgeted epsilon.  Very large numbers of iterations may divide the epsilon too small, resulting in large measurement error on each measurement.  Conversely, using too few iterations will reduce error of individual measurements, but may interefere with the algorithm converging.  For example, if your data set has 15 total one-way and two-way cuboids, and you use iterations=15, every cuboid will be measured with the maximum possible epsilon, but the fit may be poor.
+
+By default, the implementation chooses a number of iterations based on the size of the data.  The number of candidate queries generated will be based on the number of iterations, and will focus on marginals, unless ``add_ranges`` is specified.  You can also specify a ``q_count`` to ensure that a certain number of candidate queries is always generated, regardless of the number of iterations.  This can be useful when you care more about range queries than marginals, because range queries each touch a single slice, while marginal cuboids may update hundreds or thousands of slices.
+
+If you are confident that the candidate query workload is high quality, you can use ``measure_only`` to skip the query selection step, and just sample uniformly from the workload.  This can double the budget available to measure the queries, but typically is not useful for anything but the smallest datasets.
+
+The ``debug`` flag prints detailed diagnostics about the query workload generated and convergence status.
+
 DP-CTGAN
 --------
 
@@ -92,6 +107,14 @@ MST
 
 MST achieves state of the art results for marginals over categorical data, and does well even with small source data.  From McKenna et al. "`Winning the NIST Contest: A scalable and general approach to differentially private synthetic data <https://arxiv.org/abs/2108.04978>`_"
 
+Before using MST, install `Private-PGM <pip install git+https://github.com/ryan112358/private-pgm.git>`_ :
+
+.. code-block:: bash
+  
+  pip install git+https://github.com/ryan112358/private-pgm.git
+
+And call like this:
+
 .. code-block:: python
 
   import snsynth

diff --git a/synth/pyproject.toml b/synth/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "smartnoise-synth"
-version = "0.2.6"
+version = "0.2.7"
 description = "Differentially Private Synthetic Data"
 authors = ["SmartNoise Team <[email protected]>"]
 license = "MIT"

diff --git a/synth/setup.py b/synth/setup.py
@@ -4,6 +4,7 @@
 packages = \
 ['snsynth',
  'snsynth.models',
+ 'snsynth.mst',
  'snsynth.preprocessors',
  'snsynth.pytorch',
  'snsynth.pytorch.nn']
@@ -12,13 +13,13 @@
 {'': ['*']}
 
 install_requires = \
-['ctgan>=0.4.3,<0.5.0', 'opacus>=0.14.0,<0.15.0', 'opendp>=0.3.0,<0.4.0']
+['ctgan>=0.4.3,<0.5.0', 'opacus>=0.14.0,<0.15.0', 'opendp>=0.4.0,<0.5.0']
 
 setup_kwargs = {
     'name': 'smartnoise-synth',
-    'version': '0.2.5',
+    'version': '0.2.7',
     'description': 'Differentially Private Synthetic Data',
-    'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n# SmartNoise Synthesizers\n\nDifferentially private synthesizers for tabular data.  Package includes:\n* MWEM\n* QUAIL\n* DP-CTGAN\n* PATE-CTGAN\n* PATE-GAN\n\n## Installation\n\n```\npip install smartnoise-synth\n```\n\n## Using\n\n### MWEM\n\n```python\nimport pandas as pd\nimport numpy as np\n\npums = pd.read_csv(pums_csv_path, index_col=None) # in datasets/\npums = pums.drop([\'income\'], axis=1)\nnf = pums.to_numpy().astype(int)\n\nsynth = snsynth.MWEMSynthesizer(epsilon=1.0, split_factor=nf.shape[1]) \nsynth.fit(nf)\n\nsample = synth.sample(10)\nprint(sample)\n```\n### DP-CTGAN\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom snsynth.pytorch.nn import DPCTGAN\nfrom snsynth.pytorch import PytorchDPSynthesizer\n\npums = pd.read_csv(pums_csv_path, index_col=None) # in datasets/\npums = pums.drop([\'income\'], axis=1)\n\nsynth = PytorchDPSynthesizer(1.0, DPCTGAN(), None)\nsynth.fit(pums, categorical_columns=pums.columns)\n\nsample = synth.sample(10) # synthesize 10 rows\nprint(sample)\n```\n\n### PATE-CTGAN\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom snsynth.pytorch.nn import PATECTGAN\nfrom snsynth.pytorch import PytorchDPSynthesizer\n\npums = pd.read_csv(pums_csv_path, index_col=None) # in datasets/\npums = pums.drop([\'income\'], axis=1)\n\nsynth = PytorchDPSynthesizer(1.0, PATECTGAN(regularization=\'dragan\'), None)\nsynth.fit(pums, categorical_columns=pums.columns)\n\nsample = synth.sample(10) # synthesize 10 rows\nprint(sample)\n```\n\n## Note on Inputs\n\nMWEM, DP-CTGAN, and PATE-CTGAN require columns to be categorical. If you have columns with continuous values, you should discretize them before fitting.  Take care to discretize in a way that does not reveal information about the distribution of the data.\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [[email protected]](mailto:[email protected]).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. Please review the [contributors guide](../contributing.rst). We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.',
+    'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n# SmartNoise Synthesizers\n\nDifferentially private synthesizers for tabular data.  Package includes:\n* MWEM\n* MST\n* QUAIL\n* DP-CTGAN\n* PATE-CTGAN\n* PATE-GAN\n\n## Installation\n\n```\npip install smartnoise-synth\n```\n\n## Using\n\nPlease see the [SmartNoise synthesizers documentation](https://docs.smartnoise.org/synth/index.html) for usage examples.\n\n## Note on Inputs\n\nMWEM and MST require columns to be categorical. If you have columns with continuous values, you should discretize them before fitting.  Take care to discretize in a way that does not reveal information about the distribution of the data.\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [[email protected]](mailto:[email protected]).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. Please review the [contributors guide](../contributing.rst). We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.\n',
     'author': 'SmartNoise Team',
     'author_email': '[email protected]',
     'maintainer': None,