Benchmark Datasets

When developing privacy-preserving data synthesis workflows, you might face these challenges:

Unsure if your data characteristics suit specific synthesis algorithms
Uncertain about the appropriate synthesis parameters
Need a reliable reference standard for evaluation

Using benchmark datasets for testing is a good practice. Benchmark datasets have well-known characteristics and are widely used in academic research, allowing you to:

Test your synthesis workflow on benchmark data first
Verify results meet expectations
Apply the same workflow to your data

Click the below button to run this example in Colab:

---
Loader:
  data:
    filepath: 'benchmark/adult-income.csv'
  benchmark:
    filepath: 'benchmark://adult-income'
Preprocessor:
  demo:
    method: 'default'
Synthesizer:
  demo:
    method: 'default'
Postprocessor:
  demo:
    method: 'default'
Evaluator:
  demo-quality:
    method: 'sdmetrics-qualityreport'
Reporter:
  save_report_global:
    method: 'save_report'
    granularity: 'global'
...

Appx. Available Benchmark Dataset

Currently, PETsARD provides the Adult Income Dataset as a benchmark:

Name: adult-income
Source: U.S. Census Bureau
Scale: 48,842 records, 15 columns
Characteristics:
- Mixed numerical and categorical features
- Contains sensitive information (income)
- Suitable for testing privacy protection in data synthesis

Benchmark Datasets Usage

Use benchmark:// in filepath to specify the benchmark dataset
PETsARD will automatically download and verify the dataset
Subsequent synthesis and evaluation processes remain the same as with regular data

For detailed implementation of benchmark datasets, please refer to Benchmark Dataset Maintenance in the Developer Guide.

Custom Evaluation Use Cases