Benchmark Datasets
When developing privacy-preserving data synthesis workflows, you might face these challenges:
- Unsure if your data characteristics suit specific synthesis algorithms
- Uncertain about the appropriate synthesis parameters
- Need a reliable reference standard for evaluation
Using benchmark datasets for testing is a good practice. Benchmark datasets have well-known characteristics and are widely used in academic research, allowing you to:
- Test your synthesis workflow on benchmark data first
- Verify results meet expectations
- Apply the same workflow to your data
Click the below button to run this example in Colab:
---
Loader:
data:
filepath: 'benchmark/adult-income.csv'
benchmark:
filepath: 'benchmark://adult-income'
Preprocessor:
demo:
method: 'default'
Synthesizer:
demo:
method: 'default'
Postprocessor:
demo:
method: 'default'
Evaluator:
demo-quality:
method: 'sdmetrics-qualityreport'
Reporter:
save_report_global:
method: 'save_report'
granularity: 'global'
...
Appx. Available Benchmark Dataset
Currently, PETsARD
provides the Adult Income Dataset as a benchmark:
- Name: adult-income
- Source: U.S. Census Bureau
- Scale: 48,842 records, 15 columns
- Characteristics:
- Mixed numerical and categorical features
- Contains sensitive information (income)
- Suitable for testing privacy protection in data synthesis
Benchmark Datasets Usage
- Use
benchmark://
infilepath
to specify the benchmark dataset PETsARD
will automatically download and verify the dataset- Subsequent synthesis and evaluation processes remain the same as with regular data
For detailed implementation of benchmark datasets, please refer to Benchmark Dataset Maintenance in the Developer Guide.