Handling Missing Values

Most synthetic data algorithms are probabilistic models, and CAPE team research has shown that the majority cannot directly support missing values (None, np.nan, pd.NA). Even for algorithms that claim to handle missing values, it’s challenging to verify the appropriateness of their implementation methods. Therefore, PETsARD recommends proactively handling any columns containing missing values:

Numeric columns: Default to mean imputation (missing_mean)
Categorical/text/date columns: Default to row deletion (missing_drop)

PETsARD offers several methods for handling missing values.

Click the below button to run this example in Colab:

---
Loader:
  data:
    filepath: 'benchmark/adult-income.csv'
    na_values: '?' # every '?' in the dataset will be considered as missing value
Preprocessor:
  missing-only:
    # only execute the missing values handler and encoding by their default,
    #   the rest of the preprocessing steps will be skipped
    # keep encoding due to we have categorical features
    sequence:
      - 'missing'
      - 'encoder'
Synthesizer:
  demo:
    method: 'default'
Postprocessor:
  demo:
    method: 'default'
Evaluator:
  demo-quality:
    method: 'sdmetrics-qualityreport'
Reporter:
  output:
    method: 'save_data'
    source: 'Synthesizer'
  save_report_global:
    method: 'save_report'
    granularity: 'global'
...

Customized setting

This configuration is used to customize missing value handling. Setting method: 'default' indicates that all fields not specifically configured will use default processing methods.

In the missing section, three fields are customized: missing values in the workclass field will be dropped, missing values in the occupation field will be imputed with the mode value, and missing values in the native-country field will be filled with the specified value ‘Galactic Empire’.

Preprocessor:
  missing-custom:
    missing:
      workclass: 'missing_drop'
      occupation: 'missing_mode'
      native-country:
        method: 'missing-simple'
        value: 'Galactic Empire'

Missing Value Handling Methods

Drop Missing Values (missing_drop)

Removes rows containing missing values
Suitable when missing values are rare
Note: May lose important information

Statistical Imputation

Mean imputation (missing_mean): Fill with column mean
Median imputation (missing_median): Fill with column median
Mode imputation (missing_mode): Fill with most frequent value
Suitable for different data types:
- Use mean or median for numerical data
- Use mode for categorical data

Custom Imputation (missing_simple)

Fill missing values with a specified value
Requires setting the value parameter
Suitable when specific business logic applies

You can use different methods for different columns by specifying the appropriate configuration in your settings file.

Data Preprocessing Encoding Categorical Variables