Encoding Categorical Variables

Most synthetic data algorithms only support numerical field synthesis. Even when they directly support categorical field synthesis, it usually involves the synthesizer’s built-in preprocessing and post-processing restoration transformations. The CAPE team designed PETsARD specifically to control these unpredictable behaviors from third-party packages, recommending active encoding for any fields containing categorical variables:

Categorical variables: Default to Uniform Encoding, see technical details in the developer manual Uniform Encoding

Click the below button to run this example in Colab:

---
Loader:
  data:
    filepath: 'benchmark/adult-income.csv'
Preprocessor:
  encoding-only:
    # only execute the encoding by their default,
    sequence:
      - 'encoder'
Synthesizer:
  demo:
    method: 'default'
Postprocessor:
  demo:
    method: 'default'
Evaluator:
  demo-quality:
    method: 'sdmetrics-qualityreport'
Reporter:
  output:
    method: 'save_data'
    source: 'Synthesizer'
  save_report_global:
    method: 'save_report'
    granularity: 'global'
...

Custom Configuration

The following configuration is used to customize categorical encoding processing. Setting method: 'default' indicates that all fields not specifically configured will use the default processing method.

In the encoder block, we apply different encoding strategies for three fields: workclass uses uniform encoding for handling categorical values, occupation employs label encoding assuming the alphabetical order of occupation categories reflects their hierarchical nature, and native-country utilizes one-hot encoding to transform into k-dimensional binary variables, preserving the unique characteristics of each country category while avoiding artificial ordering relationships.

Preprocessor:
  encoding-custom:
    sequence:
      - 'encoder'
    encoder:
      workclass: 'encoding_uniform'
      occupation: 'encoding_label'
      native-country: 'encoding_onehot'

Encoding Methods

Uniform Encoding (encoding_uniform)

Converts categorical values to uniformly distributed numbers
Suitable for general categorical variables
Default encoding method

Label Encoding (encoding_label)

Converts categorical values to consecutive integers
Suitable for ordinal categorical variables
Preserves order relationships between categories

One-Hot Encoding (encoding_onehot)

Transforms each category into an independent feature column, where each column represents the presence or absence of a category
Categorical data is processed as independent features during synthesis and recombined afterward
Suitable for variables with fewer categories, as each additional category increases feature dimensionality

Date Encoding (encoder_date)
- Converts datetime values into numerical format for synthesis
- Supports multiple output formats:
  - Date only: Basic date information
  - Datetime: Full date and time information
  - Datetime with timezone: Complete temporal information
- Provides special features:
  - Custom calendar support (e.g., Minguo calendar)
  - Flexible date parsing with or without format strings
  - Invalid date handling strategies
  - Timezone awareness

You can use different encoding methods for different columns by specifying the appropriate configuration in your settings file.

Handling Missing Values Data Preprocessing