Data Preprocessing

Ensuring the quality of source data before synthesis is crucial. High-quality input data not only improves the synthesis results but also reduces potential technical issues during the synthesis process. PETsARD provides comprehensive data preprocessing tools to help you enhance data quality:

Important Note: CAPE’s default preprocessing pipeline performs missing value handling and outlier processing before encoding and scaling operations. Users are advised to modify this default processing order only for experimental purposes and when fully familiar with their technical process details and objectives. PETsARD does not guarantee the effectiveness of data preprocessing when the default order is altered.

Information Modification

Handling Missing Values

Handle missing and incomplete values in data
Ensure data completeness through deletion, statistical imputation, and custom imputation methods
Provide customized options for different data fields and types

Handling Outliers (WIP)

Identify and handle abnormal or extreme values
Prevent outliers from affecting model learning
Provide multiple outlier detection and processing strategies

Representation Transformation

Encoding Categorical Variables

Convert categorical data to numerical format
Support various encoding methods to preserve data characteristics
Ensure synthetic algorithms can effectively process all data types

Discretizing Continuous Values (WIP)

Convert continuous values into discrete intervals
Reduce data complexity
Provide multiple grouping strategy options

Scaling Numerical Features (WIP)

Unify value ranges across different columns
Improve model convergence performance
Support various standardization and normalization methods

Appx.: Available Process type

Following CAPE team’s preprocessing taxonomy, PETsARD subdivides data preprocessing operations into two main types and provides support for both:

Information Modification enhances data quality by addressing data imperfections. This includes:
- Missing handling: completing missing data points
- Outlier handling: smoothing data noise
Representation Transformation changes how data is represented while preserving the original information. This includes:
- Encoding: converting categorical data to numerical representation
- Discretizing: continuous values to discrete representation
- Scaling: remapping numerical ranges

The following table lists all preprocessing methods supported by PETsARD. You can learn how to use each method through the tutorial examples, or visit Processor for detailed technical implementation.

Process type	Process method	Parameters
Missing	`MissingMean`	‘missing_mean’
Missing	`MissingMedian`	‘missing_median’
Missing	`MissingMode`	‘missing_mode’
Missing	`MissingSimple`	‘missing_simple’
Missing	`MissingDrop`	‘missing_drop’
Outlier	`OutlierZScore`	‘outlier_zscore’
Outlier	`OutlierIQR`	‘outlier_iqr’
Outlier	`OutlierIsolationForest`	‘outlier_isolationforest’
Outlier	`OutlierLOF`	‘outlier_lof’
Encoding	`EncoderUniform`	’encoder_uniform'
Encoding	`EncoderLabel`	’encoder_label'
Encoding	`EncoderOneHot`	’encoder_onehot'
Discretizing	`DiscretizingKBins`	‘discretizing_kbins’
Scaling	`ScalerStandard`	‘scaler_standard’
Scaling	`ScalerZeroCenter`	‘scaler_zerocenter’
Scaling	`ScalerMinMax`	‘scaler_minmax’
Scaling	`ScalerLog`	‘scaler_log’
Scaling	`ScalerTimeAnchor`	‘scaler_timeanchor’

Data Description Comparing Synthesizers