Benchmark Dataset Maintenance

This document explains how to maintain and extend PETsARD’s benchmark dataset functionality. It is primarily intended for developers, providing guidelines for adding or modifying benchmark datasets.

Core Concepts

The benchmark dataset system design focuses on:

Dataset documentation maintenance
Download and verification mechanisms
Cache management functionality

Dataset Documentation

Basic Information Recording

Document the following basic information for each dataset:

Name: Dataset name
Filename: Filename used in the system
Access: Public/private access permission
Columns: Number of data columns
Rows: Number of data rows
File Size: File storage size
License: Usage license type
Hash: First seven characters of SHA-256 checksum

Feature Information Recording

Record the feature information of datasets:

Too Few Samples: Whether there are fewer than 5000 records
Categorical-dominant: Whether categorical columns exceed 75%
Numerical-dominant: Whether numerical columns exceed 75%
Non-dominant: Whether categorical and numerical columns are balanced
Extreme Values: Number of columns with extreme values
High Cardinality: Number of categorical columns with high cardinality

Verification Mechanism

SHA256 Verification Process

Benchmark datasets use SHA256 for file integrity verification:

Verification Tool

from petsard.util import digest_sha256


hasher = digest_sha256(filepath)
hash_value = hasher.hexdigest()

Verification Comparison
- Compare the first seven characters
- Issue warning on verification failure
- Ensure dataset integrity

Cache Management

Benchmark datasets use a local cache mechanism:

Cache Strategy
- Exists and verified: Use directly
- Does not exist: Download new file
- Verification fails: Issue warning and stop
Cache Cleanup
- Users can manually delete cache
- Recommend redownload on verification failure

Best Practices

Dataset Selection

Consider the following when selecting datasets:

Source reliability and stability
Clear license terms
Appropriate data volume
Data quality consistency

Maintenance Guidelines

Documentation Maintenance
- Update dataset list promptly
- Ensure information accuracy
- Note important changes
Data Quality
- Regularly check dataset availability
- Update broken download links
- Maintain checksum list
User Experience
- Provide clear error messages
- Improve usage instructions
- Handle compatibility issues

Development Guidelines Uniform Encoder