Constrainer
Constrainer(config)
Data constraint handler for synthetic data generation. Supports NaN handling, field-level constraints, and field combination rules.
Parameters
config
(dict): Constraint configuration dictionary containing the following keys:nan_groups
(dict): NaN handling rules- Key: Column name with NaN values
- Value for ‘delete’ action: String ‘delete’
- Value for ’erase’ and ‘copy’ actions: Dictionary containing action and target fields
- For ’erase’:
{'erase': target_field}
where target_field can be a string or list of strings - For ‘copy’:
{'copy': target_field}
where target_field is a string
- For ’erase’:
field_constraints
(List[str]): Field-level constraints as string expressions- Support operators: >, >=, ==, !=, <, <=, IS, IS NOT
- Support logical operators: &, |
- Support bracketed expressions
- Special value: “pd.NA” for NULL checks
- DATE() function for date comparisons
field_combinations
(List[tuple]): Field combination rules- Each tuple contains (field_map, allowed_values)
- field_map: Dict with one source-to-target field mapping
- allowed_values: Dict mapping source values to allowed target values
- Each tuple contains (field_map, allowed_values)
Note:
- All constraints are combined with AND logic. A row must satisfy all constraints to be kept in the result.
- Field combinations are positive listings that only affect specified values. For example, if education=‘PhD’ requires performance in [4,5], this only filters PhD records. Other education values or NULL values are not affected by this rule.
- When handling NULL values in YAML or Python configurations, always use the string “pd.NA” (case-sensitive) instead of None, np.nan, or pd.NA objects to avoid unexpected behaviors.
Examples
from petsard import Constrainer
# Configure constraints
config = {
# NaN handling rules - Specify how to handle NaN values and related fields
'nan_groups': {
'name': 'delete', # Delete entire row when name is NaN
'job': {
'erase': ['salary', 'bonus'] # Set salary and bonus to NaN when job is NaN
},
'salary': {
'copy': 'bonus' # Copy salary value to bonus when salary has value but bonus is NaN
}
},
# Field constraints - Specify value ranges for individual fields
# Supported operators: >, >=, ==, !=, <, <=, IS, IS NOT
# Supported logical operators: &, |
# Supports parentheses and DATE() function
'field_constraints': [
"age >= 20 & age <= 60", # Age must be between 20-60
"performance >= 4" # Performance must be >= 4
],
# Field combination rules - Specify value mappings between different fields
# Format: (field_map, allowed_value_pairs)
# Note: These are positive listings, unlisted values are not filtered, for example:
# - If education is not PhD/Master/Bachelor, it won't be filtered
# - Only filters if education is PhD but performance is not 4 or 5
'field_combinations': [
(
{'education': 'performance'}, # Education to performance mapping
{
'PhD': [4, 5], # PhD only allows scores 4 or 5
'Master': [4, 5], # Master only allows scores 4 or 5
'Bachelor': [3, 4, 5] # Bachelor allows scores 3, 4, 5
}
),
# Can configure multiple field combinations
(
{('education', 'performance'): 'salary'}, # Education + performance to salary mapping
{
('PhD', 5): [90000, 100000], # Salary range for PhD with performance 5
('Master', 4): [70000, 80000] # Salary range for Master with performance 4
}
)
]
}
cnst: Constrainer = Constrainer(config)
result: pd.DataFrame = cnst.apply(df)
Methods
apply()
cnst.apply(df)
Apply configured constraints to input DataFrame.
Parameters
df
(pd.DataFrame): Input DataFrame to be constrained
Returns
- pd.DataFrame: DataFrame after applying all constraints
resample_until_satisfy()
cnst.resample_until_satisfy(
data=df,
target_rows=1000,
synthesizer=synthesizer,
postprocessor=None,
max_trials=300,
sampling_ratio=10.0,
verbose_step=10
)
Resample data until meeting constraints with target number of rows.
Parameters
data
(pd.DataFrame): Input DataFrame to be constrainedtarget_rows
(int): Number of rows to achievesynthesizer
: Synthesizer instance for generating synthetic datapostprocessor
(optional): Optional postprocessor for data transformationmax_trials
(int, default=300): Maximum number of trials before giving upsampling_ratio
(float, default=10.0): Multiple of target_rows to generate in each trialverbose_step
(int, default=10): Print progress every verbose_step trials
Returns
- pd.DataFrame: DataFrame that satisfies all constraints with target number of rows
register()
Register a new constraint type.
Parameters
name
(str): Constraint type nameconstraint_class
(type): Class implementing the constraint
Returns
None
Attributes
resample_trails
: Numbers of resampling, only create after executingresample_until_satisfy()
(int)