Default Datasets

Introduction

Default datasets in Fed-BioMed are pre-built, ready-to-use datasets with automatic downloading and integration. They're ideal for prototyping, education, and benchmarking federated learning approaches.

Available Datasets:

  • MNIST: Handwritten digit recognition
  • MedNIST: Medical image classification

Key Features

  • Automatic downloading when not present
  • Framework compatibility (PyTorch tensors, NumPy arrays)
  • Zero configuration required
  • Standardized benchmarks

Deployment

Node-Side

Deploy using the Fed-BioMed node CLI, see Deploying Datasets:

fedbiomed node dataset add
# 1. Select "default" for MNIST or "mednist" for MedNIST
# 2. Specify dataset location
# 3. Add unique tags and description

Path Configuration

The path you specify when adding the dataset must match the root directory in the data structures shown below (e.g., the parent directory containing MNIST/ or MedNIST/). If the path doesn't match an existing dataset location, Fed-BioMed will download the dataset again to the specified location.

Researcher-Side

Access datasets through experiment configuration:

from fedbiomed.researcher.experiment import Experiment

experiment = Experiment(
    tags=['#MNIST', '#dataset'],
    model=my_model,
    training_plan_class=MyTrainingPlan,
    training_args=training_args
)

Tag Matching

The tags specified in the experiment configuration must match the tags assigned when registering the dataset on nodes. Only nodes with datasets that have matching tags will participate in the training. Use descriptive and consistent tags across your federated network to ensure proper dataset selection.

Default Datasets

MNIST Dataset

Classic handwritten digit classification dataset.

Dataset Characteristics:

  • Training samples: 60,000 images
  • Test samples: 10,000 images
  • Image size: 28×28 pixels
  • Classes: 10 (digits 0-9)
  • Color: Grayscale (single channel)
  • File format: IDX format (automatically handled)

Data Structure

The dataset automatically manages the standard MNIST IDX format:

root/
└── MNIST/
    └── raw/
        ├── train-images-idx3-ubyte
        ├── train-labels-idx1-ubyte
        ├── t10k-images-idx3-ubyte
        └── t10k-labels-idx1-ubyte

Sample Data Format

  • For PyTorch TrainingPlan - data comes as torch.Tensor:

    data.shape   # torch.Size([1, 28, 28]) - single channel, 28x28
    data.dtype   # torch.float32
    data.min()   # 0.0 (black pixels)
    data.max()   # 1.0 (white pixels)
    target       # tensor(7) - digit class 0-9
    

  • For scikit-learn TrainingPlan - data comes as numpy.ndarray:

    data.shape   # (28, 28) - can be flattened to (784,)
    data.dtype   # uint64
    data.min()   # 0 (black pixels)
    data.max()   # 255 (white pixels)
    target       # array(7) - digit class 0-9
    

MNIST Transform Examples

  • Basic Digit Recognition:

    mnist_transform = transforms.Compose([
        transforms.RandomRotation(10),              # Slight rotation for digits
        transforms.Normalize((0.1307,), (0.3081,))  # MNIST-specific normalization
    ])
    

  • Scikit-learn Compatible:

    def mnist_sklearn_transform(data):
        flattened = data.flatten()                  # Flatten to 784 features
        normalized = (flattened - 33.328) / 78.565  # MNIST normalization
        return normalized
    

MedNIST Dataset

The MedNIST dataset contains medical images from different imaging modalities, designed specifically for medical AI applications.

Dataset Characteristics:

  • Total samples: 58,954 medical images
  • Classes: 6 medical imaging modalities (AbdomenCT, BreastMRI, ChestCT, CXR, Hand and HeadCT)
  • Color: RGB (converted from medical imaging formats)
  • File format: JPEG format (automatically handled)

Medical Data Structure

root/
└── MedNIST/
    ├── AbdomenCT/        # Abdominal CT scans
    │   ├── 000000.jpeg
    │   ├── 000001.jpeg
    │   └── ...
    ├── BreastMRI/        # Breast MRI images
    ├── ChestCT/          # Chest CT scans
    ├── CXR/              # Chest X-Ray images
    ├── Hand/             # Hand X-Ray images
    └── HeadCT/           # Head CT scans
        └── ...

Medical Sample Data Format

  • For PyTorch TrainingPlan - data comes as torch.Tensor:

    data.shape   # torch.Size([3, 64, 64])
    data.dtype   # torch.float32
    target       # tensor(3) - medical modality class 0-5
    

  • For scikit-learn TrainingPlan - data comes as numpy.ndarray:

    data.shape   # (64, 64, 3)
    data.dtype   # uint64
    target       # array(3) - medical modality class 0-5
    

Medical Transform Examples

  • Conservative Medical Augmentation:

    medical_transform = transforms.Compose([
        transforms.Resize((224, 224)),                    # Standard medical image size
        transforms.RandomRotation(5),                     # Conservative rotation
        transforms.ColorJitter(brightness=0.1),           # Slight brightness adjustment
        transforms.Normalize(mean=[0.5], std=[0.5])       # Medical image normalization
    ])
    

  • Medical ML Pipeline (Scikit-learn):

    def medical_sklearn_transform(data):
        resized = cv2.resize(data, (32, 32))             # Smaller for ML algorithms
        flattened = resized.flatten()                    # Flatten for traditional ML
        normalized = (flattened - 127.5) / 127.5         # Normalize to [-1, 1]
        return normalized
    

Best Practices

For Node Administrators

  • Use descriptive and consistent tags when registering default datasets
  • Ensure adequate storage space for automatic downloads

For Researchers

  • Start with default datasets for initial federated learning experiments
  • Use consistent preprocessing across experiments for fair comparison
  • Leverage default datasets for validating new federated learning algorithms

For Educational Use

  • Begin with MNIST for understanding basic federated learning concepts
  • Progress to MedNIST for understanding domain-specific challenges
  • Use established transforms and benchmarks for learning