Datasets in Fed-BioMed
Introduction
Dataset classes in Fed-BioMed bridge raw data stored on nodes with the federated learning training process. They provide a standardized interface for data access across different data types and machine learning frameworks while maintaining data privacy.
Purpose
In federated learning, training occurs across multiple distributed nodes, each with potentially different data types and structures. Dataset classes solve the fundamental challenge of unified data access in heterogeneous environments.
Challenge:
- Data is distributed across nodes with varying formats (images, CSV, NIfTI, etc.)
- Different ML frameworks require different data formats (PyTorch tensors vs. NumPy arrays)
- Raw data must remain private and never leave nodes
- Training code needs consistent data interfaces across all nodes
Fed-BioMed Solution:
Dataset classes provide an abstraction layer that:
- Standardizes data loading regardless of underlying format
- Automatically converts data to the required framework format
- Enables preprocessing and augmentation at the data source
- Maintains a consistent interface for training code across the federation
Benefit:
Researchers can write training plans once and deploy them across nodes with diverse data sources, while node administrators maintain full control over their local data without exposing raw files.
Key Features
- Standardized access: Unified interface for images, tabular, and medical data
- Framework-agnostic: Automatic conversion to PyTorch (
torch.Tensor) or scikit-learn (numpy.ndarray) formats - Privacy preserving: Data remains local on nodes
- Transformation support: Apply preprocessing and augmentation
- Cross-node consistency: Harmonized data formats across federated networks
Core Elements
All Fed-BioMed datasets share these core concepts:
- Node registration: Nodes deploy datasets with unique tags
- Researcher selection: Researchers can list and select datasets using tags
- Automatic resolution: Fed-BioMed nodes resolve tags to local paths
- Format conversion: Data converted to appropriate framework format
- Custom transformations: Flexible transformations supported for data preprocessing and augmentation.
Using Datasets
Deploying Datasets on Nodes
Before use in federated training, nodes must deploy datasets with unique tags. This registers metadata and makes datasets discoverable by researchers.
Use the following command to add a dataset into the node located in the directory ./my-node
$ fedbiomed node --path my-node dataset add
Searching for Available Datasets
Researchers can identify available by searching with tags:
from fedbiomed.researcher.requests import Requests
from fedbiomed.researcher.config import config
req = Requests(config=config)
result = req.list()
The search result returns a dictionary mapping nodes to their datasets.
In Federated Training
Datasets are then referenced by tags in experiment configuration:
# Researcher side - select datasets by tags
experiment = Experiment(
tags=['#MNIST', '#dataset'],
model=model,
training_plan_class=MyTrainingPlan,
training_args=training_args
)
Local Testing
Datasets can be used for model testing and development when there is data available locally.
Dataset Types
Fed-BioMed supports several dataset types for different data modalities.
- Default Datasets: Pre-built datasets with automatic downloading (MNIST, MedNIST)
- Image Datasets: Image classification with folder-based organization (
ImageFolderDataset) - Tabular Datasets: Structured data in CSV format with numerical and categorical features
- Medical Datasets: Multi-modal medical imaging in NIfTI format with optional demographics (
MedicalFolderDataset) - Custom Datasets: Specialized data types (
CustomDataset,NativeDataset)