Categorical variables

Case Background

A public university seeks to share student academic and enrollment records with campus researchers. These records contain sensitive personal data (socioeconomic status, ethnicity, disability status), previously accessible only to designated teams in controlled environments.

Recent efforts using aggregated data with differential privacy improved access but reduced analytical precision. The Academic Affairs Office is now partnering with the Information Science Department to develop synthetic data solutions that maintain individual-level granularity and analytical accuracy while protecting privacy, aiming to enhance research capabilities and enable future cross-institutional collaboration.

Data Characteristics and Challenges

High-Cardinality Categorical variables: The diversity of student identity categories, academic departments, and admission programs results in many data fields containing numerous unique values.

A Categorical Variable is defined as a variable that can be divided into different categories or groups, where values represent classifications rather than measured values. These values are typically discrete, non-numeric labels such as gender (male, female), blood type (A, B, AB, O), city names, or education levels, and can be either nominal scale (unordered categories, such as colors) or ordinal scale (with natural ordering, such as education levels).

Since most synthetic data models, as well as statistical and machine learning algorithms, can only accept numerical field inputs, encoding is used to process nominal or ordinal scale categorical variables, allowing the data to be understood and processed by models.

Data Table and Business Context

The data structure in this case reflects the university student recruitment and enrollment management process, primarily consisting of one core data table:

Student Basic Information Table:
- Contains diverse information including student personal details (date of birth, zodiac sign, gender), academic background (department code, department name), admission channels (admission type code, admission type), and identity characteristics (disability status, nationality, identity category)
- Each record represents an individual student
- Data fields include a mix of numerical values (such as date of birth) and categorical types (such as zodiac sign, department, admission method)

Notably, this data table contains many high-cardinality categorical variables (such as department codes and names), as well as privacy-sensitive personal information (such as birth dates and identity categories). These characteristics require special attention during data synthesis to protect personal privacy while preserving relationships between data and statistical features, ensuring that synthetic data can support practical needs for educational research and decision analysis.

Demonstration Data

Considering data privacy, the following uses simulated data to demonstrate data structure and business logic. While these data are simulated, they retain the key characteristics and business constraints of the original data:

birth_year	birth_month	birth_day	zodiac	university_code	university	college_code	college	department_code	department_name	admission_type_code	admission_type	disabled_type	nationality	identity_code	identity	sex
2005	4	9	牡羊座	002	國立政治大學	700	理學院	702	心理學系	01	學士班 (指考分發/聯考)	無身心障礙	中華民國	1	一般生	女
2003	1	16	摩羯座	001	國立臺灣大學	2000	理學院	2080	地理環境資源學系	46	繁星推薦	無身心障礙	中華民國	1	一般生	女
2002	11	7	天蠍座	001	國立臺灣大學	1000	文學院	1070	日本語文學系	51	學士班申請入學	無身心障礙	中華民國	1	一般生	男
2002	12	24	摩羯座	001	國立臺灣大學	9000	電機資訊學院	9020	資訊工程學系	46	繁星推薦	無身心障礙	中華民國	1	一般生	男
2000	10	7	天秤座	001	國立臺灣大學	1000	文學院	1040	哲學系	51	學士班申請入學	無身心障礙	中華民國	1	一般生	女

Encoding Categorical Variables

Different encoding methods are suitable for different categorical variable characteristics and contexts. Our team has compiled a comparison of several mainstream encoding methods as follows:

Characteristic/Evaluation Criteria	Label Encoding	Target Encoding	Uniform Encoding	One-hot Encoding	Feature Hashing	Entity Embedding
Preserves Non-ordinality	❌	✅	✅	✅	✅	✅
Dimensionality Handling	✅	✅	✅	❌	✅	✅
Rare Category Handling	✅	⚠️	❌	✅	✅	❌
New Category Handling	❌	❌	❌	❌	✅	❌
Captures Latent Semantic Relations	❌	⚠️	❌	❌	❌	✅
Data Leakage Risk	✅	❌	✅	✅	✅	✅
Computational Efficiency	✅	✅	✅	❌	✅	❌
Memory Requirements	✅	✅	✅	❌	✅	⚠️
Manual Adjustment Needs	✅	⚠️	✅	✅	⚠️	❌
Interpretability	⚠️	⚠️	⚠️	✅	❌	❌
High Cardinality Handling	❌	✅	⚠️	❌	✅	✅

Symbol Legend:
✅ Advantage/Supported
⚠️ Moderate/Requires trade-off
❌ Disadvantage/Not supported

The academic community is still researching more universal and robust encoding strategies. Our team recommends initially using Uniform Encoding for all categorical variables as recommended by SDV ¹. When encountering high cardinality categorical variables—fields with many unique values—current mainstream encoding methods all have certain limitations. We recommend complementing this approach with the High-Cardinality attributes - Constraints discussed in the next article.

Uniform Encoding

Uniform Encoding is a categorical variable processing method proposed by datacebo, specifically designed to enhance the performance of generative models ². Its core concept is mapping discrete categorical values to a continuous [0,1] interval, where the size of each category’s corresponding interval is determined by its frequency in the original data. This method effectively transforms categorical information into continuous values while preserving the statistical characteristics of the category distribution.

In practice, if a categorical variable contains three categories ‘a’, ‘b’, and ‘c’ with occurrence ratios of 1:3:1, the encoding will map ‘a’ to the [0.0, 0.2) interval, ‘b’ to the [0.2, 0.8) interval, and ‘c’ to the [0.8, 1.0] interval, randomly selecting values within each respective interval. During restoration, the original category is determined based on which interval the numerical value falls into. This bidirectional transformation mechanism ensures data integrity throughout the modeling and restoration process.

The main advantages of Uniform Encoding lie in simultaneously solving multiple data processing challenges: it transforms discrete distributions into continuous distributions for easier modeling, provides a fixed value range for convenient restoration, and preserves the original distribution information so that common categories have a greater sampling probability. This encoding method is particularly suitable for features with fewer categories and imbalanced category distribution scenarios. It can be flexibly combined with other preprocessing methods and demonstrates excellent performance in generative model applications.

Full Demonstration

Click the button below to run the example in Colab:

In fact, PETsARD uses Uniform Encoding as the default encoding method for categorical variables. The settings here are only meant to skip outlier processing (‘outlier’) in the default workflow, which would remove too many records in this simulated dataset. You only need to use the default mode with Preprocessor, and it will automatically apply Uniform Encoding to your categorical variables.

Preprocessor:
  encoding_uniform:
    sequence:
      - 'encoder'
    encoder:
      birth_year: 'encoding_uniform'
      birth_month: 'encoding_uniform'
      birth_day: 'encoding_uniform'
      zodiac: 'encoding_uniform'
      deparment_code: 'encoding_uniform'
      department_name: 'encoding_uniform'
      admission_type_code: 'encoding_uniform'
      admission_type: 'encoding_uniform'
      disabled_code: 'encoding_uniform'
      disabled_type: 'encoding_uniform'
      nationality_code: 'encoding_uniform'
      nationality: 'encoding_uniform'
      identity_code: 'encoding_uniform'
      identity: 'encoding_uniform'
      sex: 'encoding_uniform'

Results as below:

birth_year	birth_month	birth_day	zodiac	university_code	university	college_code	college	department_code	department_name	admission_type_code	admission_type	disabled_code	disabled_type	nationality_code	nationality	identity_code	identity	sex
2004	6	10	天蠍座	001	國立臺灣大學	9000	電機資訊學院	101	電機工程學系	51	學士班申請入學	1	有身心障礙	nan	中華民國	12	一般生	男
2001	12	7	天蠍座	001	國立政治大學	100	資訊學院	117	中國文學系	51	繁星推薦	1	有身心障礙	nan	中華民國	21	原住民 (泰雅族)	男
2004	2	23	處女座	002	國立政治大學	2000	理學院	101	大氣科學系	51	繁星推薦	1	有身心障礙	nan	中華民國	4	一般生	男
2004	7	29	摩羯座	001	國立臺灣大學	2000	理學院	474	電機工程學系	51	繁星推薦	1	有身心障礙	nan	中華民國	33	原住民 (泰雅族)	男
2003	4	19	牡羊座	002	國立政治大學	1000	電機資訊學院	101	應用數學系	51	學士班申請入學	1	有身心障礙	nan	中華民國	2	一般生	女

Should Time be Treated as Categorical Variables?

When time is stored as separate year, month, and day fields, treating them as categorical variables is recommended. These elements have finite ranges and specific semantics that align with categorical data characteristics. Time components like months have cyclical properties, and treating them as continuous values can lead to inappropriate interpolation during synthesis, especially with floating-point calculations.

If time is stored as datetime type, treating it as numerical would better preserve periodicity. While merging separate date elements for processing and splitting them afterward is technically possible, PETsARD doesn’t support this. Users needing this functionality should handle time format conversion during preprocessing to achieve better time series modeling.

References

Synthetic Data Vault. (n.d.). UniformEncoder. In RDT Transformers Glossary. Retrieved 2025, from https://docs.sdv.dev/rdt/transformers-glossary/categorical/uniformencoder ↩︎
Patki, N., & Palazzo, R. (2023, August 28). Improving synthetic data up to +40% (without building new ML models). Datacebo. https://datacebo.com/blog/improvement-uniform-encoder/ ↩︎

Multi-Table data - Denormalization High-Cardinality variables - Constraints