Processor
Processor(
metadata,
config=None
)
Create a data processor to manage data preprocessing and postprocessing workflows.
Parameters
metadata
(Metadata): Data schema object providing column-level metadata and type information- Required
config
(dict, optional): Custom data processing configuration- Default: None
- Used to override default processing procedures
- Structure:
{processing type: {column name: processing method}}
Examples
from petsard import Processor
# Basic usage
proc = Processor(metadata=split.metadata)
# Using custom configuration
'missing': {'age': 'missing_mean'},
'outlier': {'income': 'outlier_iqr'}
}
proc = Processor(metadata=split.metadata, config=custom_config)
# Data Transformation
proc.fit(data=load.data)
transformed_data = proc.transform(data=load.data)
# Restore to original type/format
inverse_transformed_data = proc.inverse_transform(data=synthetic_data)
Methods
get_config()
proc.get_config(
col=None,
print_config=False
)
Parameters
col
(list, optional): Column names to retrieve configuration for- Default: None, retrieves configuration for all columns
print_config
(bool, optional): Whether to print the configuration- Default: False
Returns
- (dict): Dictionary containing processing procedure configurations
update_config()
proc.update_config(config)
Update the processor’s configuration.
Parameters
config
(dict): New processing procedure configuration
Returns
None
get_changes()
Compare current configuration with default configuration.
Parameters
None
Returns
- (pandas.DataFrame): Table recording configuration differences
fit()
proc.fit(
data,
sequence=None
)
Learn data structure and prepare transformation workflow.
Parameters
data
(pandas.DataFrame): Dataset used for learningsequence
(list, optional): Custom processing flow order- Default: None
- Available values: ‘missing’, ‘outlier’, ’encoder’, ‘scaler’, ‘discretizing’
Returns
None
transform()
proc.transform(data)
Perform data preprocessing transformation.
Parameters
data
(pandas.DataFrame): Dataset to be transformed
Returns
- (pandas.DataFrame): Transformed data
inverse_transform()
proc.inverse_transform(data)
Perform data postprocessing inverse transformation.
Parameters
data
(pandas.DataFrame): Dataset to be inverse transformed
Returns
- (pandas.DataFrame): Inverse transformed data
Appx.: Available Process type
Default Processor method
This mapping defines default processing methods for different data types. Numerical types use mean imputation, interquartile range for outliers, standard scaling, and K-bins discretization; categorical types use drop missing values, uniform encoding, and label encoding.
PROCESSOR_MAP: dict[str, dict[str, str]] = {
"missing": {
"numerical": MissingMean,
"categorical": MissingDrop,
"datetime": MissingDrop,
"object": MissingDrop,
},
"outlier": {
"numerical": OutlierIQR,
"categorical": lambda: None,
"datetime": OutlierIQR,
"object": lambda: None,
},
"encoder": {
"numerical": lambda: None,
"categorical": EncoderUniform,
"datetime": lambda: None,
"object": EncoderUniform,
},
"scaler": {
"numerical": ScalerStandard,
"categorical": lambda: None,
"datetime": ScalerStandard,
"object": lambda: None,
},
"discretizing": {
"numerical": DiscretizingKBins,
"categorical": EncoderLabel,
"datetime": DiscretizingKBins,
"object": EncoderLabel,
},
}
Config Setting
Format
config = {
processor-type: {
colname: processor-method
}
}
Examples
This configuration customizes data processing methods for different columns. The age column uses mean for missing values, Z-score for outliers, min-max scaling, and K-bins discretization; gender column is one-hot encoded; income column uses interquartile range for outliers; and salary column is standardized.
config = {
'missing': {
'age': 'missing_mean',
'salary': 'missing_median'
},
'outlier': {
'income': 'outlier_iqr',
'age': 'outlier_zscore'
},
'encoder': {
'gender': 'encoder_onehot',
'city': 'encoder_label'
},
'scaler': {
'salary': 'scaler_standard',
'age': 'scaler_minmax'
},
'discretizing': {
'age': 'discretizing_kbins'
}
}
Missing
MissingMean
Missing values are filled with the mean value of the corresponding column.
MissingMedian
Missing values are filled with the median value of the corresponding column.
MissingMode
Missing values are filled with the mode value of the corresponding column. If there are multiple modes, it will randomly fill in one of them.
MissingSimple
Missing values are filled with a predefined value for the corresponding column.
Parameters
value
(float, default=0.0): The value to be imputed.
MissingDrop
This method involves dropping the rows containing missing values in any column.
Outlier
OutlierZScore
This method classifies data as outliers if the absolute value of the z-score is greater than 3.
OutlierIQR
Data outside the range of 1.5 times the interquartile range (IQR) is determined as an outlier.
OutlierIsolationForest
This method uses IsolationForest
from sklearn
to identify outliers. It is a global transformation, meaning that if any column uses the isolation forest as an outlierist, it will overwrite the entire config and apply isolation forest to all outlierists.
OutlierLOF
This method uses LocalOutlierFactor
from sklearn
to identify outliers. It is a global transformation, meaning that if any column uses the isolation forest as an outlierist, it will overwrite the entire config and apply isolation forest to all outlierists.
Encoding
EncoderUniform
Mapping each category to a specific range within a uniform distribution, with the range size determined by the frequency of the category in the data.
EncoderLabel
Transform categorical data into numerical data by assigning a series of integers (1, 2, 3,…) to the categories.
EncoderOneHot
Transform categorical data into a one-hot numeric data.
EncoderDate
Transform non-standard date-time data into datetime format with flexible handling of various date formats, including custom calendars like Taiwan’s Minguo calendar.
Parameters
input_format
(str, optional): Format string for parsing dates- Default: None (uses fuzzy parsing)
- Example: “%Y-%m-%d” or “%MinguoY-%m-%d”
date_type
(str, default=“datetime”): Output type for transformed dates- “date”: Date only (no time component)
- “datetime”: Date and time
- “datetime_tz”: Date and time with timezone
tz
(str, optional): Timezone for output dates- Default: None
- Example: “Asia/Taipei”
numeric_convert
(bool, default=False): Whether to attempt converting numeric timestampsinvalid_handling
(str, default=“error”): How to handle invalid dates- “error”: Raise an error
- “erase”: Replace with None
- “replace”: Use replacement rules
invalid_rules
(list[dict[str, str]], optional): Rules for replacing invalid dates- Default: None
Examples
# Basic usage with standard dates
config = {
'encoder': {
'created_at': 'encoder_date'
}
}
# Using Minguo calendar format
config = {
'encoder': {
'doc_date': {
'method': 'encoder_date',
'input_format': '%MinguoY-%m-%d'
}
}
}
# With timezone and invalid handling
config = {
'encoder': {
'event_time': {
'method': 'encoder_date',
'date_type': 'datetime_tz',
'tz': 'Asia/Taipei',
'invalid_handling': 'erase'
}
}
}
Scaling
ScalerStandard
Utilising StandardScaler
from the sklearn
library, transforming the data to have a mean of 0 and a standard deviation of 1.
ScalerZeroCenter
Utilising StandardScaler
from sklearn
, this method centres the transformed data around a mean of 0.
ScalerMinMax
By applying MinMaxScaler
from sklearn
, this method scales the data to fit within the range [0, 1].
ScalerLog
This method requires the input data to be positive. It applies log transformation to mitigate the impact of extreme values.
ScalerTimeAnchor
This method scales datetime data by calculating time differences from a reference time series. It provides two modes of scaling:
Parameters
reference
(str): The name of the reference column used for time difference calculation. Must be a datetime column.unit
(str, default=‘D’): The unit of time difference calculation.- ‘D’: Days (default)
- ‘S’: Seconds
Examples
scaler:
create_time:
method: 'scaler_timeanchor'
reference: 'event_time'
unit: 'D'
Discretizing
DiscretizingKBins
Discretize continuous data into k bins (k intervals).
Parameters
n_bins
(int, default=5): The value k, the number of bins.