Constrainer
Constrainer(config)
Data constraint handler for synthetic data generation. Supports NaN handling, field-level constraints, and field combination rules.
Parameters
config
(dict): Constraint configuration dictionary containing the following keys:nan_groups
(dict): NaN handling rules- Key: Column name with NaN values
- Value for ‘delete’ action: String ‘delete’
- Value for ’erase’ and ‘copy’ actions: Dictionary containing action and target fields
- For ’erase’:
{'erase': target_field}
where target_field can be a string or list of strings - For ‘copy’:
{'copy': target_field}
where target_field is a string
- For ’erase’:
- Value for ’nan_if_condition’ action:
{'nan_if_condition': condition_dict}
- condition_dict is a dictionary where:
- Key: Target field name to check condition
- Value: Matching value(s) in the target field (can be a single value or list of values)
- When values in target fields match the specified conditions, the main field will be set to pd.NA
- condition_dict is a dictionary where:
field_constraints
(List[str]): Field-level constraints as string expressions- Support operators: >, >=, ==, !=, <, <=, IS, IS NOT
- Support logical operators: &, |
- Support bracketed expressions
- Special value: “pd.NA” for NULL checks
- DATE() function for date comparisons
field_combinations
(List[tuple]): Field combination rules- Each tuple contains (field_map, allowed_values)
- field_map: Dict with one source-to-target field mapping
- allowed_values: Dict mapping source values to allowed target values
- Each tuple contains (field_map, allowed_values)
field_proportions
(List[dict]): Field proportion maintenance rules list- Each rule is a dictionary containing:
fields
(str or List[str]): Field name(s) to maintain proportions for, can be a single field or list of fieldsmode
(str): Either ‘all’ (maintain all value distributions) or ‘missing’ (maintain missing value proportions only)tolerance
(float, optional): Allowed deviation from original proportions (0.0-1.0), defaults to 0.1 (10%)
- Each rule is a dictionary containing:
Note:
- All constraints are combined with AND logic. A row must satisfy all constraints to be kept in the result.
- Field combinations are positive listings that only affect specified values. For example, if education=‘PhD’ requires performance in [4,5], this only filters PhD records. Other education values or NULL values are not affected by this rule.
- Field proportions maintain the original data distribution by iteratively removing excess data while protecting underrepresented groups.
- When handling NULL values in YAML or Python configurations, always use the string “pd.NA” (case-sensitive) instead of None, np.nan, or pd.NA objects to avoid unexpected behaviors.
Examples
from petsard import Constrainer
# Configure constraints
config = {
# NaN handling rules - Specify how to handle NaN values and related fields
'nan_groups': {
'name': 'delete', # Delete entire row when name is NaN
'job': {
'erase': ['salary', 'bonus'] # Set salary and bonus to NaN when job is NaN
},
'salary': {
'copy': 'bonus' # Copy salary value to bonus when salary has value but bonus is NaN
}
},
# Field constraints - Specify value ranges for individual fields
# Supported operators: >, >=, ==, !=, <, <=, IS, IS NOT
# Supported logical operators: &, |
# Supports parentheses and DATE() function
'field_constraints': [
"age >= 20 & age <= 60", # Age must be between 20-60
"performance >= 4" # Performance must be >= 4
],
# Field combination rules - Specify value mappings between different fields
# Format: (field_map, allowed_value_pairs)
# Note: These are positive listings, unlisted values are not filtered, for example:
# - If education is not PhD/Master/Bachelor, it won't be filtered
# - Only filters if education is PhD but performance is not 4 or 5
'field_combinations': [
(
{'education': 'performance'}, # Education to performance mapping
{
'PhD': [4, 5], # PhD only allows scores 4 or 5
'Master': [4, 5], # Master only allows scores 4 or 5
'Bachelor': [3, 4, 5] # Bachelor allows scores 3, 4, 5
}
),
# Can configure multiple field combinations
(
{('education', 'performance'): 'salary'}, # Education + performance to salary mapping
{
('PhD', 5): [90000, 100000], # Salary range for PhD with performance 5
('Master', 4): [70000, 80000] # Salary range for Master with performance 4
}
)
],
# Field proportion rules - Maintain original data distribution proportions
'field_proportions': [
# Maintain category distribution with default tolerance (10%)
{'fields': 'category', 'mode': 'all'},
# Maintain missing value proportions for income field with custom 5% tolerance
{'fields': 'income', 'mode': 'missing', 'tolerance': 0.05},
# Maintain field combination proportions with default tolerance (10%)
{'fields': ['gender', 'age_group'], 'mode': 'all'}
]
}
cnst: Constrainer = Constrainer(config)
result: pd.DataFrame = cnst.apply(df)
Field Proportions Examples
# Example 1: Basic field proportions (using default tolerance 0.1)
config = {
'field_proportions': [
{'fields': 'category', 'mode': 'all'} # Maintain category distribution with default 10% tolerance
]
}
# Example 2: Missing value proportions (custom tolerance)
config = {
'field_proportions': [
{'fields': 'income', 'mode': 'missing', 'tolerance': 0.05} # Maintain missing ratio with 5% tolerance
]
}
# Example 3: Multiple field combinations
config = {
'field_proportions': [
{'fields': 'category', 'mode': 'all'}, # Default 10% tolerance
{'fields': 'income', 'mode': 'missing', 'tolerance': 0.05}, # Custom 5% tolerance
{'fields': ['gender', 'age_group'], 'mode': 'all', 'tolerance': 0.15} # Custom 15% tolerance
]
}
Methods
apply()
cnst.apply(df)
Apply configured constraints to input DataFrame.
Parameters
df
(pd.DataFrame): Input DataFrame to be constrained
Returns
- pd.DataFrame: DataFrame after applying all constraints
resample_until_satisfy()
cnst.resample_until_satisfy(
data=df,
target_rows=1000,
synthesizer=synthesizer,
postprocessor=None,
max_trials=300,
sampling_ratio=10.0,
verbose_step=10
)
Resample data until meeting constraints with target number of rows.
Parameters
data
(pd.DataFrame): Input DataFrame to be constrainedtarget_rows
(int): Number of rows to achievesynthesizer
: Synthesizer instance for generating synthetic datapostprocessor
(optional): Optional postprocessor for data transformationmax_trials
(int, default=300): Maximum number of trials before giving upsampling_ratio
(float, default=10.0): Multiple of target_rows to generate in each trialverbose_step
(int, default=10): Print progress every verbose_step trials
Returns
- pd.DataFrame: DataFrame that satisfies all constraints with target number of rows
register()
Register a new constraint type.
Parameters
name
(str): Constraint type nameconstraint_class
(type): Class implementing the constraint
Returns
None
Attributes
resample_trails
: Numbers of resampling, only create after executingresample_until_satisfy()
(int)