FLamby in Fed-BioMed¶
This tutorial demonstrates how to use FLamby datasets in Fed-BioMed. You'll learn:
- How to download FLamby datasets
- How to deploy FLamby datasets for different centers using separate data partitioning
- How to define datasets for FLamby examples in your federated learning experiments
Overview¶
FLamby is a comprehensive benchmark suite for federated learning in healthcare. The datasets are not included directly in the FLamby installation due to licensing and size constraints. Each dataset must be downloaded separately using dedicated download scripts provided by the FLamby library.
This notebook provides a comprehensive guide on how to:
- Discover available FLamby datasets - Find which datasets are available in your FLamby installation
- Download datasets programmatically - Use Python subprocess to execute download scripts
- Deploy downloaded datasets - Configure datasets for use with Fed-BioMed nodes
- Verify successful downloads - Ensure datasets are complete and properly configured
For detailed information about FLamby integration concepts and training plan implementation, please visit the FLamby dataset introduction tutorial.
This hands-on tutorial will focus specifically on deploying the Fed Heart Disease dataset that comes with FLamby, providing you with step-by-step instructions to successfully deploy this dataset for federated learning experiments.
Prerequisites¶
Before starting, ensure you have:
- FLamby installed:
pip install git+https://github.com/owkin/FLamby@main - wget dependency:
pip install wget - Fed-BioMed installed: Make sure your Fed-BioMed environment is properly configured
- Sufficient disk space: FLamby datasets can be several GB in size
- Internet connection: Required for downloading datasets from external sources
Import Required Libraries¶
In this section, we'll import the necessary libraries and explore the available FLamby datasets. This step helps us understand what datasets are available in your FLamby installation before proceeding with downloads.
The code below will:
- Import essential Python libraries for file handling and dataset discovery
- Load the FLamby datasets module
- Display a list of all available FLamby datasets in your installation
import pkgutil
from pathlib import Path
from flamby import datasets
# List of available FLamby datasets
list(i.name for i in pkgutil.iter_modules(datasets.__path__))
Datasets have to downloaded using download.py script provided in the Flamby library/module. Therrefore, we have to find the correct download script for the given dataset
import flamby.datasets.fed_heart_disease
dataset_root = Path(flamby.datasets.fed_heart_disease.__file__).parent
print(dataset_root)
download_script = dataset_root / "dataset_creation_scripts" / "download.py"
!y | python {download_script} --output-folder ./data/fed_heart_disease
Deploying Datasets¶
After the datasets are downloaded, they can be deployed on Fed-BioMed nodes. To deploy datasets, Fed-BioMed CustomDataset type will be used.
Please execute the following commands to create Fed-BioMed node components:
!fedbiomed component create -c node --path ./node-1 -n my-first-node
!fedbiomed component create -c node --path ./node-2 -n my-second-node
After the nodes are create, FLamby dataset can be deployed. To do that, a JSON file has tobe created that contains where data located and which center/partition is going to be be used for that dataset.
Please keep in mind that this is scnaiors for testing, since FLamby datasets are downloaded once and repartioned after dataset definiiition should be passed through JSON file to be able to deploy two different nodes.
import os
import json
# Get the absolute path to the downloaded FLamby dataset
abs_path = os.path.abspath("./data/fed_heart_disease")
# Create dataset configuration for Node 1 (using center/partition 1)
dataset_description = {"center": 1, "dataset-path": abs_path}
# Create dataset configuration for Node 2 (using center/partition 2)
dataset_description_2 = {"center": 2, "dataset-path": abs_path}
# Save dataset configuration for Node 1
node_data_path = os.path.abspath("./node-1/data/dataset_description.json")
with open(node_data_path, 'w') as f:
json.dump(dataset_description, f)
# Save dataset configuration for Node 2
node_data_path_2 = os.path.abspath("./node-2/data/dataset_description.json")
with open(node_data_path_2, 'w') as f:
json.dump(dataset_description_2, f)
The JSON files defined above are used in the TrainingPlan to load the correct partition for each node. The datasets still need to be deployed on the nodes. There are two options: (1) use the interactive CLI to define the dataset name, tags, and data path one by one; or (2) use a JSON file that contains the dataset metadata (tags, name, path). To keep this tutorial simple, we will use the JSON file method to add datasets.
dataset_for_node_1 = {"name": "fed_heart_disease_node_1",
"data_type": "custom",
"tags": "flamby,fed_heart_disease",
"description": "Heart disease dataset for federated learning",
"path": node_data_path}
dataset_for_node_2 = {"name": "fed_heart_disease_node_2",
"data_type": "custom",
"tags": "flamby,fed_heart_disease",
"description": "Heart disease dataset for federated learning",
"path": node_data_path_2}
with open('./node_1_dataset_metadata.json', 'w') as f:
json.dump(dataset_for_node_1, f)
with open('./node_2_dataset_metadata.json', 'w') as f:
json.dump(dataset_for_node_2, f)
After dataset metadata/descriptor JSON files are saved, the dataset can be deployed on the node using the command below.
!fedbiomed node -p ./node-1 dataset add --file ./node_1_dataset_metadata.json
!fedbiomed node -p ./node-2 dataset add --file ./node_2_dataset_metadata.json
Writing the TrainingPlan¶
Let's demonstrate downloading the popular Fed Heart Disease dataset:
from fedbiomed.common.dataset import CustomDataset
from fedbiomed.common.training_plans import TorchTrainingPlan
from flamby.datasets.fed_heart_disease import (
FedHeartDisease,
Baseline,
BaselineLoss,
Optimizer
)
from fedbiomed.common.data import DataManager
class FedHeartTrainingPlan(TorchTrainingPlan):
def init_model(self, model_args):
return Baseline()
def init_optimizer(self, optimizer_args):
return Optimizer(self.model().parameters(), lr=optimizer_args["lr"])
def init_dependencies(self):
return ["from flamby.datasets.fed_heart_disease import FedHeartDisease, Baseline, BaselineLoss, Optimizer",
"from fedbiomed.common.datamanager import DataManager",
"from fedbiomed.common.dataset import CustomDataset"
]
def training_step(self, data, target):
output = self.model().forward(data)
return BaselineLoss().forward(output, target)
class MyFedHeartDataset(CustomDataset):
def read(self):
"""Read FLamby data"""
# Read json file that is deployed on the node
import json
with open(self.path) as f:
flamby_data = json.load(f)
# Create data file
self.data = FedHeartDisease(
center=flamby_data["center"],
data_path=flamby_data["dataset-path"]
)
def get_item(self, item):
"""Get item"""
return self.data[item]
def __len__(self):
"""Dataset length"""
return len(self.data)
def training_data(self, batch_size=2):
dataset = self.MyFedHeartDataset()
train_kwargs = {'batch_size': batch_size, 'shuffle': True}
return DataManager(dataset, **train_kwargs)
After defining the training plan, set model_args and training_args. This tutorial uses the simple FLamby baseline model, so no additional model-specific arguments are required (you can leave model_args empty).
model_args = {}
training_args = {
'loader_args': { 'batch_size': 16, },
'optimizer_args': {
'lr': 0.001,
},
'epochs': 2,
'dry_run': False,
'log_interval': 2,
'test_ratio' : 0.2,
'test_batch_size': 16,
'test_on_global_updates': True,
'test_on_local_updates': True,
'batch_maxnum': 10 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
Define the Experiment¶
The Experiment ties nodes, datasets, the training plan and aggregation into a single federated run.
Key fields:
tags— dataset tags used to select participating nodes.training_plan_class— training plan implementing model, loss and optimizer.model_args— params passed to the training plan for model init.training_args— data loader, optimizer, epoch and runtime options.aggregator— server-side aggregation strategy (e.g.FedAverage()).round_limit— number of federated rounds to execute.
Please make sure that two nodes that has the datasets deployed are up and running before running your experiment.
fedbiomed node -p ./node-1 start
fedbiomed node -p ./node-2 start
Note: ensure deployed datasets use matching tags and correct descriptor JSONs. For fast development, lower round_limit, enable dry_run, or set batch_maxnum.
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['flamby', 'fed_heart_disease']
num_rounds = 2
exp = Experiment(tags=tags,
model_args=model_args,
training_plan_class=FedHeartTrainingPlan,
training_args=training_args,
round_limit=num_rounds,
aggregator=FedAverage(),
)
exp.run()
Troubleshooting¶
- Ensure the FLamby dataset is downloaded to the location referenced by the dataset descriptor JSON files (the path in
"dataset-path"). An empty or missing data folder will cause data-loading errors (for example IndexError). - Verify each node's
dataset_description.jsonexists and contains the required fields: at minimum"dataset-path"(absolute or relative path to the downloaded FLamby data) and"center"(the partition/center number). - If you get errors when adding a dataset with the
fedbiomedCLI using the--fileoption, validate the JSON for correct syntax, field names, and valid paths. If the issue persists, add the dataset interactively instead:fedbiomed node -p <path-to-node> dataset addand follow the prompts to provide the dataset name, tags, and path.