• Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • How to Cite Us
    • Contact Us
  • Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • How to Cite Us
    • Contact Us
  • Getting Started
    • What's Fed-BioMed
    • Fedbiomed Architecture
    • Fedbiomed Workflow
    • Installation
    • Basic Example
    • Configuration
  • Tutorials
    • PyTorch
      • PyTorch MNIST Basic Example
      • How to Create Your Custom PyTorch Training Plan
      • PyTorch Used Cars Dataset Example
      • Transfer-learning in Fed-BioMed tutorial
      • PyTorch aggregation methods in Fed-BioMed
    • MONAI
      • Federated 2d image classification with MONAI
      • Federated 2d XRay registration with MONAI
    • Scikit-Learn
      • MNIST classification with Scikit-Learn Classifier (Perceptron)
      • Fed-BioMed to train a federated SGD regressor model
      • Implementing other Scikit Learn models for Federated Learning
    • Optimizers
      • Advanced optimizers in Fed-BioMed
    • FLamby
      • Introduction
      • FLamby in Fed-BioMed
    • Advanced
      • In Depth Experiment Configuration
      • PyTorch model training using a GPU
      • Breakpoints
    • Security
      • Using Differential Privacy with OPACUS on Fed-BioMed
      • Local and Central DP with Fed-BioMed: MONAI 2d image registration
      • Training Process with Training Plan Management
      • Training with Secure Aggregation
      • End-to-end Privacy Preserving Training and Inference on Medical Data
    • Biomedical data
      • Brain Segmentation
  • User Guide
    • Glossary
    • Datasets
      • Introduction
      • Default Datasets
      • Image Datasets
      • Tabular Datasets
      • Medical Datasets
      • Adding your Custom Dataset
      • Adding a Native Dataset
      • Applying Transformations
    • Deployment
      • Introduction
      • VPN Deployment
      • Network matrix
      • Security model
    • Node
      • Configuring Nodes
      • Deploying Datasets
      • Training Plan Management
      • Using GPU
      • Node GUI
    • Researcher
      • Training Plan
      • Training Data
      • Experiment
      • Aggregation
      • Listing Datasets and Selecting Nodes
      • Model Validation on the Node Side
      • Tensorboard
    • Optimization
    • Secure Aggregation
      • Introduction
      • Configuration
      • Managing Secure Aggregation in Researcher
  • Developer
    • API Reference
      • Common
        • Certificate Manager
        • CLI
        • Config
        • Constants
        • Data
        • DB
        • Exceptions
        • IPython
        • Json
        • Logger
        • Message
        • Metrics
        • Model
        • MPC controller
        • Optimizers
        • Privacy
        • Secagg
        • Secagg Manager
        • Serializer
        • Singleton
        • Synchro
        • TasksQueue
        • TrainingPlans
        • TrainingArgs
        • Utils
        • Validator
      • Node
        • CLI
        • CLI Utils
        • Config
        • DatasetManager
        • HistoryMonitor
        • Node
        • NodeStateManager
        • Requests
        • Round
        • Secagg
        • Secagg Manager
        • TrainingPlanSecurityManager
      • Researcher
        • Aggregators
        • CLI
        • Config
        • Datasets
        • Federated Workflows
        • Filetools
        • Jobs
        • Monitor
        • NodeStateAgent
        • Requests
        • Secagg
        • Strategies
      • Transport
        • Client
        • Controller
        • NodeAgent
        • Server
    • Usage and Tools
    • Continuous Integration
    • Definition of Done
    • Development Environment
    • Testing in Fed-BioMed
    • RPC Protocol and Messages
  • FAQ & Troubleshooting
Download Notebook

FLamby in Fed-BioMed¶

This tutorial demonstrates how to use FLamby datasets in Fed-BioMed. You'll learn:

  • How to download FLamby datasets
  • How to deploy FLamby datasets for different centers using separate data partitioning
  • How to define datasets for FLamby examples in your federated learning experiments

Overview¶

FLamby is a comprehensive benchmark suite for federated learning in healthcare. The datasets are not included directly in the FLamby installation due to licensing and size constraints. Each dataset must be downloaded separately using dedicated download scripts provided by the FLamby library.

This notebook provides a comprehensive guide on how to:

  1. Discover available FLamby datasets - Find which datasets are available in your FLamby installation
  2. Download datasets programmatically - Use Python subprocess to execute download scripts
  3. Deploy downloaded datasets - Configure datasets for use with Fed-BioMed nodes
  4. Verify successful downloads - Ensure datasets are complete and properly configured

For detailed information about FLamby integration concepts and training plan implementation, please visit the FLamby dataset introduction tutorial.

This hands-on tutorial will focus specifically on deploying the Fed Heart Disease dataset that comes with FLamby, providing you with step-by-step instructions to successfully deploy this dataset for federated learning experiments.

Prerequisites¶

Before starting, ensure you have:

  • FLamby installed: pip install git+https://github.com/owkin/FLamby@main
  • wget dependency: pip install wget
  • Fed-BioMed installed: Make sure your Fed-BioMed environment is properly configured
  • Sufficient disk space: FLamby datasets can be several GB in size
  • Internet connection: Required for downloading datasets from external sources

Import Required Libraries¶

In this section, we'll import the necessary libraries and explore the available FLamby datasets. This step helps us understand what datasets are available in your FLamby installation before proceeding with downloads.

The code below will:

  • Import essential Python libraries for file handling and dataset discovery
  • Load the FLamby datasets module
  • Display a list of all available FLamby datasets in your installation
In [ ]:
Copied!
import pkgutil
from pathlib import Path
import pkgutil from pathlib import Path
In [ ]:
Copied!
from flamby import datasets

# List of available FLamby datasets
list(i.name for i in pkgutil.iter_modules(datasets.__path__))
from flamby import datasets # List of available FLamby datasets list(i.name for i in pkgutil.iter_modules(datasets.__path__))

Datasets have to downloaded using download.py script provided in the Flamby library/module. Therrefore, we have to find the correct download script for the given dataset

In [ ]:
Copied!
import flamby.datasets.fed_heart_disease
dataset_root = Path(flamby.datasets.fed_heart_disease.__file__).parent
print(dataset_root)
download_script = dataset_root / "dataset_creation_scripts" / "download.py"
!y | python {download_script} --output-folder ./data/fed_heart_disease
import flamby.datasets.fed_heart_disease dataset_root = Path(flamby.datasets.fed_heart_disease.__file__).parent print(dataset_root) download_script = dataset_root / "dataset_creation_scripts" / "download.py" !y | python {download_script} --output-folder ./data/fed_heart_disease

Deploying Datasets¶

After the datasets are downloaded, they can be deployed on Fed-BioMed nodes. To deploy datasets, Fed-BioMed CustomDataset type will be used.

Please execute the following commands to create Fed-BioMed node components:

In [ ]:
Copied!
!fedbiomed component create -c node --path ./node-1 -n my-first-node
!fedbiomed component create -c node --path ./node-2 -n my-second-node
!fedbiomed component create -c node --path ./node-1 -n my-first-node !fedbiomed component create -c node --path ./node-2 -n my-second-node

After the nodes are create, FLamby dataset can be deployed. To do that, a JSON file has tobe created that contains where data located and which center/partition is going to be be used for that dataset.

Please keep in mind that this is scnaiors for testing, since FLamby datasets are downloaded once and repartioned after dataset definiiition should be passed through JSON file to be able to deploy two different nodes.

In [ ]:
Copied!
import os
import json 

# Get the absolute path to the downloaded FLamby dataset
abs_path = os.path.abspath("./data/fed_heart_disease")

# Create dataset configuration for Node 1 (using center/partition 1)
dataset_description = {"center": 1, "dataset-path": abs_path}

# Create dataset configuration for Node 2 (using center/partition 2) 
dataset_description_2 = {"center": 2, "dataset-path": abs_path}

# Save dataset configuration for Node 1
node_data_path = os.path.abspath("./node-1/data/dataset_description.json")
with open(node_data_path, 'w') as f:
    json.dump(dataset_description, f)

# Save dataset configuration for Node 2
node_data_path_2 = os.path.abspath("./node-2/data/dataset_description.json")
with open(node_data_path_2, 'w') as f:
    json.dump(dataset_description_2, f)
import os import json # Get the absolute path to the downloaded FLamby dataset abs_path = os.path.abspath("./data/fed_heart_disease") # Create dataset configuration for Node 1 (using center/partition 1) dataset_description = {"center": 1, "dataset-path": abs_path} # Create dataset configuration for Node 2 (using center/partition 2) dataset_description_2 = {"center": 2, "dataset-path": abs_path} # Save dataset configuration for Node 1 node_data_path = os.path.abspath("./node-1/data/dataset_description.json") with open(node_data_path, 'w') as f: json.dump(dataset_description, f) # Save dataset configuration for Node 2 node_data_path_2 = os.path.abspath("./node-2/data/dataset_description.json") with open(node_data_path_2, 'w') as f: json.dump(dataset_description_2, f)

The JSON files defined above are used in the TrainingPlan to load the correct partition for each node. The datasets still need to be deployed on the nodes. There are two options: (1) use the interactive CLI to define the dataset name, tags, and data path one by one; or (2) use a JSON file that contains the dataset metadata (tags, name, path). To keep this tutorial simple, we will use the JSON file method to add datasets.

In [ ]:
Copied!
dataset_for_node_1 = {"name": "fed_heart_disease_node_1",
                      "data_type": "custom",
                      "tags": "flamby,fed_heart_disease",
                      "description": "Heart disease dataset for federated learning",
                      "path": node_data_path}

dataset_for_node_2 = {"name": "fed_heart_disease_node_2",
                      "data_type": "custom",
                      "tags": "flamby,fed_heart_disease",
                      "description": "Heart disease dataset for federated learning",
                      "path": node_data_path_2}

with open('./node_1_dataset_metadata.json', 'w') as f:
    json.dump(dataset_for_node_1, f)

with open('./node_2_dataset_metadata.json', 'w') as f:
    json.dump(dataset_for_node_2, f)
dataset_for_node_1 = {"name": "fed_heart_disease_node_1", "data_type": "custom", "tags": "flamby,fed_heart_disease", "description": "Heart disease dataset for federated learning", "path": node_data_path} dataset_for_node_2 = {"name": "fed_heart_disease_node_2", "data_type": "custom", "tags": "flamby,fed_heart_disease", "description": "Heart disease dataset for federated learning", "path": node_data_path_2} with open('./node_1_dataset_metadata.json', 'w') as f: json.dump(dataset_for_node_1, f) with open('./node_2_dataset_metadata.json', 'w') as f: json.dump(dataset_for_node_2, f)

After dataset metadata/descriptor JSON files are saved, the dataset can be deployed on the node using the command below.

In [ ]:
Copied!
!fedbiomed node -p ./node-1 dataset add --file ./node_1_dataset_metadata.json
!fedbiomed node -p ./node-2 dataset add --file ./node_2_dataset_metadata.json
!fedbiomed node -p ./node-1 dataset add --file ./node_1_dataset_metadata.json !fedbiomed node -p ./node-2 dataset add --file ./node_2_dataset_metadata.json

Writing the TrainingPlan¶

Let's demonstrate downloading the popular Fed Heart Disease dataset:

In [ ]:
Copied!
from fedbiomed.common.dataset import CustomDataset
from fedbiomed.common.training_plans import TorchTrainingPlan
from flamby.datasets.fed_heart_disease import (
    FedHeartDisease, 
    Baseline, 
    BaselineLoss, 
    Optimizer
)
from fedbiomed.common.data import DataManager

class FedHeartTrainingPlan(TorchTrainingPlan):
    def init_model(self, model_args):
        return Baseline()

    def init_optimizer(self, optimizer_args):
        return Optimizer(self.model().parameters(), lr=optimizer_args["lr"])

    def init_dependencies(self):
        return ["from flamby.datasets.fed_heart_disease import FedHeartDisease, Baseline, BaselineLoss, Optimizer",
                "from fedbiomed.common.datamanager import DataManager", 
                "from fedbiomed.common.dataset import CustomDataset"
                ]

    def training_step(self, data, target):
        output = self.model().forward(data)
        return BaselineLoss().forward(output, target)

    class MyFedHeartDataset(CustomDataset):

        def read(self):
            """Read FLamby data"""            
            
            # Read json file that is deployed on the node
            import json
            with open(self.path) as f:
                flamby_data = json.load(f)

            # Create data file
            self.data = FedHeartDisease(
                center=flamby_data["center"], 
                data_path=flamby_data["dataset-path"]
            )

        def get_item(self, item):
            """Get item"""
            return self.data[item]
        
        def __len__(self):
            """Dataset length"""
            return len(self.data)

    def training_data(self, batch_size=2):
        dataset = self.MyFedHeartDataset()
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset, **train_kwargs)
from fedbiomed.common.dataset import CustomDataset from fedbiomed.common.training_plans import TorchTrainingPlan from flamby.datasets.fed_heart_disease import ( FedHeartDisease, Baseline, BaselineLoss, Optimizer ) from fedbiomed.common.data import DataManager class FedHeartTrainingPlan(TorchTrainingPlan): def init_model(self, model_args): return Baseline() def init_optimizer(self, optimizer_args): return Optimizer(self.model().parameters(), lr=optimizer_args["lr"]) def init_dependencies(self): return ["from flamby.datasets.fed_heart_disease import FedHeartDisease, Baseline, BaselineLoss, Optimizer", "from fedbiomed.common.datamanager import DataManager", "from fedbiomed.common.dataset import CustomDataset" ] def training_step(self, data, target): output = self.model().forward(data) return BaselineLoss().forward(output, target) class MyFedHeartDataset(CustomDataset): def read(self): """Read FLamby data""" # Read json file that is deployed on the node import json with open(self.path) as f: flamby_data = json.load(f) # Create data file self.data = FedHeartDisease( center=flamby_data["center"], data_path=flamby_data["dataset-path"] ) def get_item(self, item): """Get item""" return self.data[item] def __len__(self): """Dataset length""" return len(self.data) def training_data(self, batch_size=2): dataset = self.MyFedHeartDataset() train_kwargs = {'batch_size': batch_size, 'shuffle': True} return DataManager(dataset, **train_kwargs)

After defining the training plan, set model_args and training_args. This tutorial uses the simple FLamby baseline model, so no additional model-specific arguments are required (you can leave model_args empty).

In [ ]:
Copied!
model_args = {}

training_args = {
    'loader_args': { 'batch_size': 16, },
    'optimizer_args': {
        'lr': 0.001,
    },
    'epochs': 2,
    'dry_run': False,
    'log_interval': 2,
    'test_ratio' : 0.2,
    'test_batch_size': 16,
    'test_on_global_updates': True,
    'test_on_local_updates': True,
    'batch_maxnum': 10 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
model_args = {} training_args = { 'loader_args': { 'batch_size': 16, }, 'optimizer_args': { 'lr': 0.001, }, 'epochs': 2, 'dry_run': False, 'log_interval': 2, 'test_ratio' : 0.2, 'test_batch_size': 16, 'test_on_global_updates': True, 'test_on_local_updates': True, 'batch_maxnum': 10 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples }

Define the Experiment¶

The Experiment ties nodes, datasets, the training plan and aggregation into a single federated run.

Key fields:

  • tags — dataset tags used to select participating nodes.
  • training_plan_class — training plan implementing model, loss and optimizer.
  • model_args — params passed to the training plan for model init.
  • training_args — data loader, optimizer, epoch and runtime options.
  • aggregator — server-side aggregation strategy (e.g. FedAverage()).
  • round_limit — number of federated rounds to execute.

Please make sure that two nodes that has the datasets deployed are up and running before running your experiment.

fedbiomed node -p ./node-1 start
fedbiomed node -p ./node-2 start

Note: ensure deployed datasets use matching tags and correct descriptor JSONs. For fast development, lower round_limit, enable dry_run, or set batch_maxnum.

In [ ]:
Copied!
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['flamby', 'fed_heart_disease']
num_rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=FedHeartTrainingPlan,
                 training_args=training_args,
                 round_limit=num_rounds,
                 aggregator=FedAverage(),
                )
from fedbiomed.researcher.federated_workflows import Experiment from fedbiomed.researcher.aggregators.fedavg import FedAverage tags = ['flamby', 'fed_heart_disease'] num_rounds = 2 exp = Experiment(tags=tags, model_args=model_args, training_plan_class=FedHeartTrainingPlan, training_args=training_args, round_limit=num_rounds, aggregator=FedAverage(), )
In [ ]:
Copied!
exp.run()
exp.run()

Troubleshooting¶

  • Ensure the FLamby dataset is downloaded to the location referenced by the dataset descriptor JSON files (the path in "dataset-path"). An empty or missing data folder will cause data-loading errors (for example IndexError).
  • Verify each node's dataset_description.json exists and contains the required fields: at minimum "dataset-path" (absolute or relative path to the downloaded FLamby data) and "center" (the partition/center number).
  • If you get errors when adding a dataset with the fedbiomed CLI using the --file option, validate the JSON for correct syntax, field names, and valid paths. If the issue persists, add the dataset interactively instead: fedbiomed node -p <path-to-node> dataset add and follow the prompts to provide the dataset name, tags, and path.
Download Notebook
  • Overview
  • Prerequisites
  • Import Required Libraries
  • Deploying Datasets
  • Writing the TrainingPlan
    • Define the Experiment
  • Troubleshooting
Address:

2004 Rte des Lucioles, 06902 Sophia Antipolis

E-mail:

fedbiomed _at_ inria _dot_ fr

Fed-BioMed © 2022