Data Preprocessing

Data Preprocessing

Ensuring the quality of source data before synthesis is crucial. High-quality input data not only improves the synthesis results but also reduces potential technical issues during the synthesis process. PETsARD provides comprehensive data preprocessing tools to help you enhance data quality:

Important Note: CAPE’s default preprocessing pipeline performs missing value handling and outlier processing before encoding and scaling operations. Users are advised to modify this default processing order only for experimental purposes and when fully familiar with their technical process details and objectives. PETsARD does not guarantee the effectiveness of data preprocessing when the default order is altered.

Information Modification

Handling Missing Values

  • Handle missing and incomplete values in data
  • Ensure data completeness through deletion, statistical imputation, and custom imputation methods
  • Provide customized options for different data fields and types

Handling Outliers (WIP)

  • Identify and handle abnormal or extreme values
  • Prevent outliers from affecting model learning
  • Provide multiple outlier detection and processing strategies

Representation Transformation

Encoding Categorical Variables

  • Convert categorical data to numerical format
  • Support various encoding methods to preserve data characteristics
  • Ensure synthetic algorithms can effectively process all data types

Discretizing Continuous Values (WIP)

  • Convert continuous values into discrete intervals
  • Reduce data complexity
  • Provide multiple grouping strategy options

Scaling Numerical Features (WIP)

  • Unify value ranges across different columns
  • Improve model convergence performance
  • Support various standardization and normalization methods

Appx.: Available Process type

Following CAPE team’s preprocessing taxonomy, PETsARD subdivides data preprocessing operations into two main types and provides support for both:

  • Information Modification enhances data quality by addressing data imperfections. This includes:

    • Missing handling: completing missing data points
    • Outlier handling: smoothing data noise
  • Representation Transformation changes how data is represented while preserving the original information. This includes:

    • Encoding: converting categorical data to numerical representation
    • Discretizing: continuous values to discrete representation
    • Scaling: remapping numerical ranges

The following table lists all preprocessing methods supported by PETsARD. You can learn how to use each method through the tutorial examples, or visit Processor for detailed technical implementation.

Process typeProcess methodParameters
MissingMissingMean‘missing_mean’
MissingMissingMedian‘missing_median’
MissingMissingMode‘missing_mode’
MissingMissingSimple‘missing_simple’
MissingMissingDrop‘missing_drop’
OutlierOutlierZScore‘outlier_zscore’
OutlierOutlierIQR‘outlier_iqr’
OutlierOutlierIsolationForest‘outlier_isolationforest’
OutlierOutlierLOF‘outlier_lof’
EncodingEncoderUniform’encoder_uniform'
EncodingEncoderLabel’encoder_label'
EncodingEncoderOneHot’encoder_onehot'
DiscretizingDiscretizingKBins‘discretizing_kbins’
ScalingScalerStandard‘scaler_standard’
ScalingScalerZeroCenter‘scaler_zerocenter’
ScalingScalerMinMax‘scaler_minmax’
ScalingScalerLog‘scaler_log’
ScalingScalerTimeAnchor‘scaler_timeanchor’