Data Constraining
Constrain synthetic data through field value rules, field combinations, field proportions, and NA handling strategies. Current implementation supports four types of constraints: field constraints, field combinations, field proportions, and NA groups.
Click the below button to run this example in Colab:
---
Loader:
data:
filepath: 'benchmark/adult-income.csv'
Preprocessor:
demo:
method: 'default'
Synthesizer:
demo:
method: 'default'
Postprocessor:
demo:
method: 'default'
Constrainer:
demo:
nan_groups:
# Delete entire row when workclass is NA
workclass: 'delete'
# Set income to NA if occupation is NA
occupation:
'erase':
- 'income'
# Copy educational-num value to age when educational-num exists but age is NA
age:
'copy':
'educational-num'
field_constraints:
- "age >= 18 & age <= 65"
- "hours-per-week >= 20 & hours-per-week <= 60"
field_combinations:
-
- education: income
- Doctorate: ['>50K']
Masters: ['>50K', '<=50K']
field_proportions:
field_proportions:
# Maintain education distribution with 10% tolerance
- education:
mode: 'all'
tolerance: 0.1
# Maintain income distribution with 5% tolerance
- income:
mode: 'all'
tolerance: 0.05
# Maintain workclass missing value proportions with 3% tolerance
- workclass:
mode: 'missing'
tolerance: 0.03
# Maintain education-income combination proportions with 15% tolerance
# Note: Complex keys with tuples are not yet supported in YAML format
# This will be added in a future update
Reporter:
output:
method: 'save_data'
source: 'Constrainer'
...
Data Constraint Methods
Data constraint is a refined mechanism for controlling the quality and consistency of synthetic data, allowing users to define acceptable data ranges through multi-layered rules. PETsARD
provides four primary constraint types: NaN group constraints, field constraints, field combination constraints, and field proportion constraints. These constraints collectively ensure that generated synthetic data is not only statistically faithful to the original data but also complies with specific domain logic and business regulations.
Notes:
- All constraint conditions are combined using a strict “all must be satisfied” logic, meaning a single data point must simultaneously satisfy all defined constraint conditions to be retained. In other words, only data records that fully comply with each constraint rule will pass the filter.
- Field combination rules use a positive list approach, affecting only specified values, and do not impact values not mentioned in the fields.
- When using NA values in YAML, always use the string
"pd.NA"
- Users are strongly recommended to thoroughly examine the original data before defining constraint conditions to ensure that the designed constraint rules accurately reflect the data’s inherent characteristics.
NaN (Missing Value) Group Constraints (nan_groups
)
NaN group constraints allow customized handling of missing data
delete
: Delete the entire row when a specific field is NAerase
: Set other fields to NA when the primary field is NAcopy
: Copy values to other fields when the primary field has a value
Data constraints do not conflict with data preprocessing missing value handling (
missing
) because the constraint mechanism filters and validates data after synthesis and data restoration. These two steps play complementary roles: the preprocessing stage handles basic missing value issues to assist synthesis, while the constraint mechanism further ensures that synthetic data complies with specific domain logic and statistical specifications.Constrainer: demo: nan_groups: # Delete entire row when workclass is NA workclass: 'delete' # Set income to NA when occupation is NA occupation: 'erase': - 'income' # Copy educational-num value to age when age is NA and educational-num has a value age: 'copy': 'educational-num'
Field Constraints (field_constraints
)
Field constraints allow setting specific value range rules for individual fields
Supported operators:
- Comparison operators:
>
,>=
,==
,!=
,<
,<=
- Logical operators:
&
(and),|
(or) - Special checks:
IS
,IS NOT
- Date function:
DATE()
function allows declaring specific dates and flexibly combining with other fields and logical operators
- Comparison operators:
The current field constraint implementation in
PETsARD
uses a custom syntax parser that supports complex logical operations and field comparisons, capable of handling nested boolean expressions. However, some functional limitations exist. For specific constraint needs or complex filtering logic that cannot be satisfied, it is recommended to remove extreme cases from the original data or contact the CAPE team directly for more customized solutions.Constrainer: demo: field_constraints: - "age >= 18 & age <= 65" # Age limit between 18-65 - "hours-per-week >= 20 & hours-per-week <= 60" # Work hours limit between 20-60 - "income == '<=50K' | (age > 50 & hours-per-week < 40)" # Low income or older with fewer work hours - "native-country IS NOT 'United-States'" # Non-US nationality - "occupation IS pd.NA" # Missing occupation information - "education == 'Doctorate' & income == '>50K'" # Doctorate must have high income - "(race != 'White') == (income == '>50K')" # Mutually exclusive check between non-white race and high income - "(marital-status == 'Married-civ-spouse' & hours-per-week > 40) | (marital-status == 'Never-married' & age < 30)" # Complex logical combination
Field Combination Constraints (field_combinations
)
Field combination constraints allow defining value range relationships between different fields
Supported combination types:
- Single field mapping: Constraints based on a single field’s values
- Multiple field mapping: Simultaneously considering multiple field values for more complex constraints
For the example below:
- For income: Only Doctorate and Masters’ incomes are constrained, Bachelor’s income is not affected
- For salary: Only Doctorate from the US, Canada, and the UK have specific salary range limits
- Doctorates from countries other than these three, or people from these countries who are not Doctorates, will not be filtered or affected
In the current implementation, field combination constraints use a positive list approach, supporting only explicitly listed value combinations. Numeric fields can enumerate valid values but do not yet support logical numeric comparisons using operators like in field constraints.
Constrainer: demo: field_combinations: - - {'education': 'income'} - { 'Doctorate': ['>50K'], # Doctorate only allows high income 'Masters': ['>50K', '<=50K'] # Masters allows high and low income } - - {('education', 'native-country'): 'salary'} - { ('Doctorate', 'United-States'): [90000, 100000], # Doctorate in the US, salary range ('Doctorate', 'Canada'): [85000, 95000], # Doctorate in Canada, salary range ('Doctorate', 'United-Kingdom'): [80000, 90000] # Doctorate in the UK, salary range }
Field Proportion Constraints (field_proportions
)
Field proportion constraints maintain the original data distribution proportions during constraint filtering
Supported modes:
all
: Maintain the distribution of all values in the fieldmissing
: Maintain only the proportion of missing values
Tolerance parameter controls the acceptable deviation from original proportions (0.0-1.0)
Supports both single fields and field combinations
The target number of rows is automatically determined during the resampling process
Constrainer: demo: field_proportions: field_proportions: # Maintain education distribution with 10% tolerance - {'education': {'mode': 'all', 'tolerance': 0.1}} # Maintain income distribution with 5% tolerance - {'income': {'mode': 'all', 'tolerance': 0.05}} # Maintain workclass missing value proportions with 3% tolerance - {'workclass': {'mode': 'missing', 'tolerance': 0.03}} # Maintain education-income combination proportions with 15% tolerance - {('education', 'income'): {'mode': 'all', 'tolerance': 0.15}}
Field Proportions Notes:
- Field proportion constraints use iterative filtering to maintain data distributions while removing excess data
- The constrainer protects underrepresented data groups while filtering out overrepresented ones
- Tolerance values should be set based on the acceptable deviation from original proportions
- Field combinations create complex distribution patterns that are maintained during filtering
- The target number of rows is provided automatically by the main Constrainer during resampling