This example demonstrates using a Nvidia GPU for training a model.
The nodes for this example need to run on a machine providing a Nvidia GPU with enough GPU memory (and from a not-too-old model, so that it is supported by PyTorch).
If GPU doesn't have enough memory you will get a out of memory error at run time.
You can check Fed-BioMed GPU documentation for some background about using GPUs with Fed-BioMed.
Set up the nodes up¶
We need at least 1 node, let's test using 3 nodes.
- For each node, add the MNIST dataset :
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config1.ini dataset add
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config2.ini dataset add
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config3.ini dataset add
- Select option 2 (default) to add MNIST to the node
- Confirm default tags by hitting "y" and ENTER
- Pick the folder where MNIST is already downloaded (or where to download MNIST)
- Check that your data has been added by executing
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config1.ini dataset list
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config2.ini dataset list
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config3.ini dataset list
- Run the first node using
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config1.ini start --gpu
so that the node offers to use GPU for training, with the default GPU device.
- Run the second node using
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config2.ini start --gpu-only --gpunum 1
so that the node enforces use of GPU for training even if the researcher doesn't request it, and requests using the 2nd GPU (device 1) but will fallback to default device if you don't have 2 GPUs on this machine.
- Run the third node using
{FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config3.ini start
so that the node doesn't offer to use GPU for training (default behaviour).
- Wait until you get
Starting task manager
for each node, it means you are online.
Define the training plan¶
All this part is the same as when running a model using CPU : model in unchanged
Declare a training plan class to send for training on the node
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms
# Here we define the training plan to be used.
class MyTrainingPlan(TorchTrainingPlan):
# Defines and return model
def init_model(self, model_args):
return self.Net(model_args = model_args)
# Defines and return optimizer
def init_optimizer(self, optimizer_args):
return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])
# Declares and returns dependencies
def init_dependencies(self):
deps = ["from torchvision import datasets, transforms"]
return deps
class Net(nn.Module):
def __init__(self, model_args):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def training_data(self):
# Custom torch Dataloader for MNIST data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
loader_arguments = { 'shuffle': True}
return DataManager(dataset=dataset1, **loader_arguments)
def training_step(self, data, target):
output = self.model().forward(data)
loss = torch.nn.functional.nll_loss(output, target)
return loss
Define the experiment parameters¶
training_args
are used by the researcher to request the nodes to use GPU for training, if the node has a GPU and offers to use it.
model_args = {}
training_args = {
'loader_args': { 'batch_size': 48, },
'optimizer_args': {
'lr': 1e-3
},
'use_gpu': True, # Activates GPU
'epochs': 1,
'dry_run': False,
'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
Declare and run the experiment¶
All this part is the same as when running a model using CPU : experiment declaration and running is unchanged
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['#MNIST', '#dataset']
rounds = 2
exp = Experiment(tags=tags,
model_args=model_args,
training_plan_class=MyTrainingPlan,
training_args=training_args,
round_limit=rounds,
aggregator=FedAverage(),
node_selection_strategy=None)
Let's start the experiment.
By default, this function doesn't stop until all the round_limit
rounds are done for all the nodes
exp.run()
You have completed training a TorchTrainingPlan
using a GPU for acceleration.