mpUCCs Experimental Feature
Overview
mpUCCs (Maximal Partial Unique Column Combinations) is an experimental privacy risk assessment tool in the PETsARD system, based on the theory of maximal partial unique column combinations, providing more accurate and efficient privacy risk assessment than traditional singling-out attack methods.
Click the below button to run this example in Colab:
---
Loader:
data:
filepath: 'benchmark/adult-income.csv'
Splitter:
demo:
num_samples: 1
train_split_ratio: 0.8
Preprocessor:
demo:
method: 'default'
Synthesizer:
demo:
method: 'default'
Postprocessor:
demo:
method: 'default'
Evaluator:
demo-mpuccs:
method: 'mpuccs'
n_cols:
- 1
- 2
- 3
- 4
- 5
Reporter:
output:
method: 'save_data'
source: 'Synthesizer'
save_report_global:
method: 'save_report'
granularity: 'global'
...
Theoretical Foundation
Core Concepts
UCC (Unique Column Combinations)
UCC refers to field combinations where all values are unique across all records in the dataset, with no duplicates.
Example: For address data, the address is a unique value. The address can also be viewed as a unique combination of county, township, road, and house number.
pUCC (Partial Unique Column Combinations)
pUCC refers to combinations that are unique only under specific conditions or specific values, not unique across the entire dataset.
Example: In most cases, street names and house numbers are not unique because there are many streets with the same name in different townships. Only (1.) special road names or (2.) special house numbers are unique values.
mpUCCs (Maximal Partial Unique Column Combinations)
mpUCCs refer to pUCCs in maximal form, meaning there is no smaller subset that can achieve the same identification effect.
Example: For “Zhongxiao East Road” “Section 1” “No. 1”, since other counties also have Zhongxiao East Road, removing any field attribute cannot achieve unique identification, which is mpUCCs.
Key Theoretical Insights
mpUCCs = QIDs (Quasi-identifiers)
The essence of singling-out attacks is:
- Identify a unique field combination in synthetic data
- This combination also corresponds to a unique record in original data
This is essentially equivalent to finding pUCCs and then checking for collisions.
Self-contained Anonymity
When a dataset has no feature combinations (IDs + QIDs) that can uniquely identify original entities, the dataset is considered anonymized.
Finding QIDs (Find-QIDs problem) is equivalent to discovering mpUCCs!
Repeatedly calculating non-maximal field combinations will overestimate risk - this is the negative expression of the set-theoretic meaning of singling-out risk!
Algorithm Implementation
Challenges of Find-QIDs Problem
- For k attributes, potential QIDs are 2^k - 1 combinations
- Proven to be a W[2]-complete problem (Bläsius et al., 2017)
- The problem lacks optimal substructure, so dynamic programming cannot be applied
Example: Knowing that {A, B} and {B, C} have no pUCCs does not mean {A, B, C} has none.
Our Solution: Heuristic Greedy Cardinality-Prioritized Algorithm
1. High-Cardinality Field Priority
- Calculate cardinality of all fields
- For numeric fields, round to lowest precision
- Process field combinations breadth-first: few to many, high cardinality first
2. Set Operations on Field and Value Domain Combinations
- Use
collections.Counter
to capture value domain combinations with only one occurrence in synthetic data - Compare to find the same value domain combinations with only one occurrence in original data
- Record corresponding original and synthetic data indices
3. Pruning Strategy
If all value domain combinations of a field combination are unique and colliding, skip its supersets.
4. Masking Mechanism
For synthetic data already identified by high-cardinality few-field combinations, that row no longer collides.
5. Early Stopping Based on Conditional Entropy
We propose an algorithm based on functional dependency entropy research (Mandros et al., 2020):
For field combinations with k ≥ 2:
- Field Combination Entropy H(XY) = entropy(Counter(syn_data[XY]) / syn_n_rows)
- Conditional Entropy H(Y|X) = Σ p(X = x)*H(Y | X = x), where x ∈ {pUCC, ¬pUCC}
- Mutual Information I(X; Y) = H(Y) - H(Y|X)
Early Stopping: If mutual information is negative, subsequent inherited field combinations are no longer confirmed.
6. Rényi Entropy (α=2, Collision Entropy)
We use Rényi entropy instead of Shannon entropy for better collision probability analysis:
- Theoretical Maximum Entropy = log(n_rows)
- Synthetic Data Maximum Entropy = scipy.stats.entropy(Counter(syn_data))
- Field Combination Entropy = scipy.stats.entropy(Counter(syn_data[column_combos]))
- Normalization = Synthetic Data Maximum Entropy - Field Combination Entropy
Key Improvements Over Anonymeter
1. Theoretical Foundation
- Clear theoretical basis: mpUCCs = QIDs provides solid mathematical foundation
- Avoids risk overestimation: Focuses on maximal combinations only
- Set-theoretic meaning: Proper understanding of singling-out risk
2. Algorithm Optimization
- Progressive tree-based search: Efficient field combination exploration
- Entropy-based pruning: Intelligent early stopping mechanism
- Cardinality-prioritized processing: High-cardinality fields processed first
- Collision-focused analysis: Direct focus on actual privacy risks
3. Precision Handling
- Automatic numeric precision detection: Handles floating-point comparisons
- Datetime precision support: Handles temporal data appropriately
- Manual precision override: Allows custom precision settings
4. Performance Improvements
- Faster execution: 5 minutes vs 12+ minutes on adult-income dataset
- Better scalability: Efficient handling of high-dimensional data
- Memory optimization: Counter-based uniqueness detection
5. Comprehensive Progress Tracking
- Dual-layer progress bars: Field-level and combination-level progress
- Detailed execution tree: Complete audit trail of algorithm decisions
- Pruning statistics: Transparency in optimization decisions
Configuration Parameters
config = {
'eval_method': 'mpuccs',
'n_cols': None, # Target combination sizes (None/int/list)
'min_entropy_delta': 0.0, # Minimum entropy gain threshold
'field_decay_factor': 0.5, # Field decay factor for weighting
'renyi_alpha': 2.0, # Rényi entropy parameter (collision entropy)
'numeric_precision': None, # Numeric precision (auto-detect or manual)
'datetime_precision': None # Datetime precision (auto-detect or manual)
}
Parameter Details
n_cols
None
: Evaluate all combination sizes from 1 to number of fieldsint
: Evaluate only specific combination sizelist
: Evaluate specific combination sizes (supports skip patterns like [1, 3])
min_entropy_delta
- Minimum entropy gain required to continue exploring a branch
- Any positive value means pruning occurs when there is any entropy difference
- Higher values lead to more aggressive pruning
- Default: 0.0 (no entropy-based pruning)
field_decay_factor
- Weighting factor for larger field combinations
- Reflects the practical difficulty of using more fields in attacks
- Default: 0.5 (each additional field halves the weight)
renyi_alpha
- Alpha parameter for Rényi entropy calculation
- α=2 corresponds to collision entropy, suitable for privacy analysis
- Default: 2.0
Usage Examples
Basic Usage
from petsard.evaluator import Evaluator
# Initialize evaluator
evaluator = Evaluator('mpuccs')
evaluator.create()
# Evaluate privacy risk
results = evaluator.eval({
'ori': original_data,
'syn': synthetic_data
})
# Access results
global_stats = results['global']
detailed_results = results['details']
tree_analysis = results['tree']
Advanced Configuration
# Custom configuration
evaluator = Evaluator('mpuccs',
n_cols=[1, 2, 3], # Only 1, 2, 3 field combinations
min_entropy_delta=0.1, # Aggressive pruning
field_decay_factor=0.3, # Strong decay for large combinations
numeric_precision=2) # 2 decimal places for numbers
evaluator.create()
results = evaluator.eval({
'ori': original_data,
'syn': synthetic_data
})
Skip Pattern Configuration
# Skip 2-field combinations, only evaluate 1 and 3-field combinations
evaluator = Evaluator('mpuccs', n_cols=[1, 3])
evaluator.create()
results = evaluator.eval({
'ori': original_data,
'syn': synthetic_data
})
Output Results
Global Results
{
'total_syn_records': 1000, # Total synthetic records
'total_ori_records': 1000, # Total original records
'total_identified': 150, # Successfully identified records
'identification_rate': 0.15, # Basic identification rate
'weighted_identification_rate': 0.12, # Weighted identification rate
'total_combinations_checked': 45, # Total combinations evaluated
'total_combinations_pruned': 12, # Combinations pruned by algorithm
'config_n_cols': '[1, 2, 3]', # Configuration used
'config_min_entropy_delta': 0.1, # Entropy threshold used
'config_field_decay_factor': 0.5, # Decay factor used
'config_renyi_alpha': 2.0, # Rényi alpha parameter
'config_numeric_precision': 2, # Numeric precision applied
'config_datetime_precision': 'D' # Datetime precision applied
}
Details Results
[
{
'combo_size': 2, # Number of fields in combination
'syn_idx': 42, # Synthetic data index
'field_combo': "('age', 'income')", # Field combination used
'value_combo': "(25, 50000)", # Values that caused collision
'ori_idx': 123 # Corresponding original data index
},
# ... more collision records
]
Tree Results
[
{
'check_order': 1, # Processing order
'combo_size': 2, # Combination size
'field_combo': "('age', 'income')", # Field combination
'base_combo': "('age',)", # Base combination for entropy calculation
'base_is_pruned': False, # Whether base was pruned
'combo_entropy': 0.85, # Normalized Rényi entropy
'base_entropy': 0.72, # Base combination entropy
'entropy_gain': 0.13, # Entropy gain from base
'is_pruned': False, # Whether this combination was pruned
'mpuccs_cnt': 5, # Number of unique combinations found
'mpuccs_collision_cnt': 3, # Number of successful collisions
'field_weighted': 0.5, # Field-based weighting
'total_weighted': 0.5, # Total weighting applied
'weighted_mpuccs_collision_cnt': 1.5 # Weighted collision count
},
# ... more tree nodes
]
Performance Characteristics
Computational Complexity
- Time Complexity: O(2^k) in worst case, but with significant pruning
- Space Complexity: O(n*k) where n is number of records, k is number of fields
- Practical Performance: Linear to sub-quadratic on real datasets due to pruning
Scalability
- Field Scalability: Highly scalable with pruning - can handle datasets with many fields efficiently
- Record Scalability: Tested on datasets with 100K+ records
- Memory Efficiency: Counter-based operations minimize memory usage
Comparison with Anonymeter
Metric | Anonymeter | mpUCCs | Improvement |
---|---|---|---|
Execution Time (adult-income, n_cols=3) | 12+ minutes | 44 seconds | 16x faster |
Singling-out Detection | ~1,000-2,000 (random sampling) | 7,999 (complete evaluation) | Complete coverage |
Theoretical Foundation | Heuristic | Mathematical | Solid theory |
Risk Overestimation | High | Low | Accurate assessment |
Progress Visibility | Not supported | Comprehensive | Full transparency |
Precision Handling | Not supported | Automatic | Better usability |
Best Practices
1. Configuration Selection
- Use default settings for optimal results
2. Data Preprocessing
- Ensure consistent data types between original and synthetic data
- Consider appropriate precision for numeric and datetime fields
- Remove or handle missing values consistently
3. Result Interpretation
- Focus on
weighted_identification_rate
for practical risk assessment - Examine
details
results to understand specific vulnerabilities - Use
tree
results to understand algorithm decisions and optimization
4. Performance Optimization
- Use skip patterns (
n_cols=[1, 3]
) to focus on specific combination sizes - Consider field selection to reduce dimensionality if needed
Limitations and Future Work
Current Limitations
- Experimental Status: Still under active development and validation
- Memory Usage: Can be memory-intensive for very high-dimensional data
- Risk Weighting: Theoretically sound risk weighting methods are under research, currently using field_decay_factor = 0.5
Future Enhancements
- Distributed Computing: Support for parallel processing of large datasets (nice-to-have)
References
Abedjan, Z., & Naumann, F. (2011). Advancing the discovery of unique column combinations. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 1565-1570).
Mandros, P., Kaltenpoth, D., Boley, M., & Vreeken, J. (2020). Discovering Functional Dependencies from Mixed-Type Data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1404-1414).
Bläsius, T., Friedrich, T., Lischeid, J., Meeks, K., & Schirneck, M. (2017). Efficiently enumerating hitting sets of hypergraphs arising in data profiling. In Proceedings of the 16th International Symposium on Experimental Algorithms (pp. 130-145).
Support and Feedback
As an experimental feature, mpUCCs is actively being developed and improved. We welcome feedback, bug reports, and suggestions for enhancement. Please refer to the project’s issue tracker for reporting problems or requesting features.