Default Datasets

Introduction

Default datasets in Fed-BioMed are pre-built, ready-to-use datasets with automatic downloading and integration. They're ideal for prototyping, education, and benchmarking federated learning approaches.

Available Datasets:

MNIST: Handwritten digit recognition
MedNIST: Medical image classification

Key Features

Automatic downloading when not present
Framework compatibility (PyTorch tensors, NumPy arrays)
Zero configuration required
Standardized benchmarks

Deployment

Node-Side

Deploy using the Fed-BioMed node CLI, see Deploying Datasets:

fedbiomed node dataset add
# 1. Select "default" for MNIST or "mednist" for MedNIST
# 2. Specify dataset location
# 3. Add unique tags and description

Path Configuration

The path you specify when adding the dataset must match the root directory in the data structures shown below (e.g., the parent directory containing MNIST/ or MedNIST/). If the path doesn't match an existing dataset location, Fed-BioMed will download the dataset again to the specified location.

Researcher-Side

Access datasets through experiment configuration:

from fedbiomed.researcher.experiment import Experiment

experiment = Experiment(
    tags=['#MNIST', '#dataset'],
    model=my_model,
    training_plan_class=MyTrainingPlan,
    training_args=training_args
)

Tag Matching

The tags specified in the experiment configuration must match the tags assigned when registering the dataset on nodes. Only nodes with datasets that have matching tags will participate in the training. Use descriptive and consistent tags across your federated network to ensure proper dataset selection.

Default Datasets

MNIST Dataset

Classic handwritten digit classification dataset.

Dataset Characteristics:

Training samples: 60,000 images
Test samples: 10,000 images
Image size: 28×28 pixels
Classes: 10 (digits 0-9)
Color: Grayscale (single channel)
File format: IDX format (automatically handled)

Data Structure

The dataset automatically manages the standard MNIST IDX format:

root/
└── MNIST/
    └── raw/
        ├── train-images-idx3-ubyte
        ├── train-labels-idx1-ubyte
        ├── t10k-images-idx3-ubyte
        └── t10k-labels-idx1-ubyte

Sample Data Format

For PyTorch TrainingPlan - data comes as torch.Tensor:

data.shape   # torch.Size([1, 28, 28]) - single channel, 28x28
data.dtype   # torch.float32
data.min()   # 0.0 (black pixels)
data.max()   # 1.0 (white pixels)
target       # tensor(7) - digit class 0-9

For scikit-learn TrainingPlan - data comes as numpy.ndarray:

data.shape   # (28, 28) - can be flattened to (784,)
data.dtype   # uint64
data.min()   # 0 (black pixels)
data.max()   # 255 (white pixels)
target       # array(7) - digit class 0-9

MNIST Transform Examples

Basic Digit Recognition:

mnist_transform = transforms.Compose([
    transforms.RandomRotation(10),              # Slight rotation for digits
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST-specific normalization
])

Scikit-learn Compatible:

def mnist_sklearn_transform(data):
    flattened = data.flatten()                  # Flatten to 784 features
    normalized = (flattened - 33.328) / 78.565  # MNIST normalization
    return normalized

MedNIST Dataset

The MedNIST dataset contains medical images from different imaging modalities, designed specifically for medical AI applications.

Dataset Characteristics:

Total samples: 58,954 medical images
Classes: 6 medical imaging modalities (AbdomenCT, BreastMRI, ChestCT, CXR, Hand and HeadCT)
Color: RGB (converted from medical imaging formats)
File format: JPEG format (automatically handled)

Medical Data Structure

root/
└── MedNIST/
    ├── AbdomenCT/        # Abdominal CT scans
    │   ├── 000000.jpeg
    │   ├── 000001.jpeg
    │   └── ...
    ├── BreastMRI/        # Breast MRI images
    ├── ChestCT/          # Chest CT scans
    ├── CXR/              # Chest X-Ray images
    ├── Hand/             # Hand X-Ray images
    └── HeadCT/           # Head CT scans
        └── ...

Medical Sample Data Format

For PyTorch TrainingPlan - data comes as torch.Tensor:

data.shape   # torch.Size([3, 64, 64])
data.dtype   # torch.float32
target       # tensor(3) - medical modality class 0-5

For scikit-learn TrainingPlan - data comes as numpy.ndarray:

data.shape   # (64, 64, 3)
data.dtype   # uint64
target       # array(3) - medical modality class 0-5

Medical Transform Examples

Conservative Medical Augmentation:

medical_transform = transforms.Compose([
    transforms.Resize((224, 224)),                    # Standard medical image size
    transforms.RandomRotation(5),                     # Conservative rotation
    transforms.ColorJitter(brightness=0.1),           # Slight brightness adjustment
    transforms.Normalize(mean=[0.5], std=[0.5])       # Medical image normalization
])

Medical ML Pipeline (Scikit-learn):

def medical_sklearn_transform(data):
    resized = cv2.resize(data, (32, 32))             # Smaller for ML algorithms
    flattened = resized.flatten()                    # Flatten for traditional ML
    normalized = (flattened - 127.5) / 127.5         # Normalize to [-1, 1]
    return normalized

Best Practices

For Node Administrators

Use descriptive and consistent tags when registering default datasets
Ensure adequate storage space for automatic downloads

For Researchers

Start with default datasets for initial federated learning experiments
Use consistent preprocessing across experiments for fair comparison
Leverage default datasets for validating new federated learning algorithms

For Educational Use

Begin with MNIST for understanding basic federated learning concepts
Progress to MedNIST for understanding domain-specific challenges
Use established transforms and benchmarks for learning