Tabular Datasets
Introduction
Tabular datasets in Fed-BioMed handle structured data in tabular formats for classification and regression tasks.
Key Features: - Automatic data loading with delimiter detection - Format conversion (pandas, NumPy, PyTorch tensors) - Framework compatibility (PyTorch and scikit-learn)
Data Structure
Basic CSV:
feature1,feature2,feature3,target
1.2,3.4,5.6,0
2.1,4.3,6.5,1
Data Preparation
- Clean data (handle missing values, outliers)
- Format as CSV for maximum compatibility
- Validate data types in columns
Deployment
Node-Side
Register using the Fed-BioMed node CLI, see Deploying Datasets for details.:
fedbiomed node dataset add
# 1. Select "csv"
# 2. Path: /path/to/your/data.csv
# 3. Tags: #tabular
Researcher-Side
Access tabular datasets through experiment configuration:
from fedbiomed.researcher.experiment import Experiment
# Select nodes with tabular classification datasets
experiment = Experiment(
tags=['#tabular'],
model=my_model,
training_plan_class=MyTrainingPlan,
training_args=training_args
)
Integration with Training Plans
PyTorch Training Plan for Tabular Data
...
Scikit-learn Training Plan for Tabular Data
...
Best Practices and Common Issues
- Verify all numerical columns contain only numbers
- Check for mixed data types in the same column
- Handle string representations of numbers
- Choose appropriate imputation strategies
- Consider the impact of missing data
- Validate features don't introduce data leakage
- Check for infinite or NaN values