• Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • Contact Us
  • Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • Contact Us
  • Getting Started
    • What's Fed-BioMed
    • Fedbiomed Architecture
    • Fedbiomed Workflow
    • Installation
    • Basic Example
    • Configuration
  • Tutorials
    • PyTorch
      • PyTorch MNIST Basic Example
      • How to Create Your Custom PyTorch Training Plan
      • PyTorch Used Cars Dataset Example
      • PyTorch aggregation methods in Fed-BioMed
      • Transfer-learning in Fed-BioMed tutorial
    • MONAI
      • Federated 2d image classification with MONAI
      • Federated 2d XRay registration with MONAI
    • Scikit-Learn
      • MNIST classification with Scikit-Learn Classifier (Perceptron)
      • Fed-BioMed to train a federated SGD regressor model
      • Implementing other Scikit Learn models for Federated Learning
    • Optimizers
      • Advanced optimizers in Fed-BioMed
    • FLamby
      • General Concepts
      • FLamby integration in Fed-BioMed
    • Advanced
      • In Depth Experiment Configuration
      • PyTorch model training using a GPU
      • Breakpoints
    • Security
      • Using Differential Privacy with OPACUS on Fed-BioMed
      • Local and Central DP with Fed-BioMed: MONAI 2d image registration
      • Training Process with Training Plan Management
      • Training with Secure Aggregation
      • End-to-end Privacy Preserving Training and Inference on Medical Data
    • Biomedical data
      • Brain Segmentation
  • User Guide
    • Glossary
    • Deployment
      • Introduction
      • VPN Deployment
      • Network matrix
      • Security model
    • Node
      • Configuring Nodes
      • Deploying Datasets
      • Training Plan Management
      • Using GPU
      • Node GUI
    • Researcher
      • Training Plan
      • Training Data
      • Experiment
      • Aggregation
      • Listing Datasets and Selecting Nodes
      • Model Validation on the Node Side
      • Tensorboard
    • Optimization
    • Secure Aggregation
      • Introduction
      • Configuration
      • Managing Secure Aggregation in Researcher
  • Developer
    • API Reference
      • Common
        • Certificate Manager
        • CLI
        • Config
        • Constants
        • Data
        • DB
        • Exceptions
        • IPython
        • Json
        • Logger
        • Message
        • Metrics
        • Model
        • MPC controller
        • Optimizers
        • Privacy
        • Secagg
        • Secagg Manager
        • Serializer
        • Singleton
        • Synchro
        • TasksQueue
        • TrainingPlans
        • TrainingArgs
        • Utils
        • Validator
      • Node
        • CLI
        • CLI Utils
        • Config
        • DatasetManager
        • HistoryMonitor
        • Node
        • NodeStateManager
        • Requests
        • Round
        • Secagg
        • Secagg Manager
        • TrainingPlanSecurityManager
      • Researcher
        • Aggregators
        • CLI
        • Config
        • Datasets
        • Federated Workflows
        • Filetools
        • Jobs
        • Monitor
        • NodeStateAgent
        • Requests
        • Secagg
        • Strategies
      • Transport
        • Client
        • Controller
        • NodeAgent
        • Server
    • Usage and Tools
    • Continuous Integration
    • Definition of Done
    • Development Environment
    • Testing in Fed-BioMed
    • RPC Protocol and Messages
  • FAQ & Troubleshooting
Download Notebook

PyTorch model training using a GPU¶

Introduction¶

This example demonstrates using a Nvidia GPU for training a model.

The nodes for this example need to run on a machine providing a Nvidia GPU with enough GPU memory (and from a not-too-old model, so that it is supported by PyTorch).

If GPU doesn't have enough memory you will get a out of memory error at run time.

You can check Fed-BioMed GPU documentation for some background about using GPUs with Fed-BioMed.

Set up the nodes up¶

We need at least 1 node, let's test using 3 nodes.

  1. For each node, add the MNIST dataset :
fedbiomed node --path node-1 dataset add
fedbiomed node --path node-2 dataset add
fedbiomed node --path node-3 dataset add
  • Select option 2 (default) to add MNIST to the node
  • Confirm default tags by hitting "y" and ENTER
  • Pick the folder where MNIST is already downloaded (or where to download MNIST)
  1. Check that your data has been added by executing
fedbiomed node --path node-1 dataset list
fedbiomed node --path node-2 dataset list
fedbiomed node --path node-3 dataset list
  1. Run the first node using
fedbiomed node --path node-1 start --gpu

so that the node offers to use GPU for training, with the default GPU device.

  1. Run the second node using
fedbiomed node --path node-2 start --gpu-only --gpunum 1

so that the node enforces use of GPU for training even if the researcher doesn't request it, and requests using the 2nd GPU (device 1) but will fallback to default device if you don't have 2 GPUs on this machine.

  1. Run the third node using
fedbiomed node --path node-3 start

so that the node doesn't offer to use GPU for training (default behaviour).

  1. Wait until you get Starting task manager for each node, it means you are online.

Define the training plan¶

All this part is the same as when running a model using CPU : model in unchanged

Declare a training plan class to send for training on the node

In [ ]:
Copied!
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms


# Here we define the training plan to be used.
class MyTrainingPlan(TorchTrainingPlan):

    # Defines and return model
    def init_model(self, model_args):
        return self.Net(model_args = model_args)

    # Defines and return optimizer
    def init_optimizer(self, optimizer_args):
        return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])

    # Declares and returns dependencies
    def init_dependencies(self):
        deps = ["from torchvision import datasets, transforms"]
        return deps

    class Net(nn.Module):
        def __init__(self, model_args):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout(0.25)
            self.dropout2 = nn.Dropout(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)

        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)


            output = F.log_softmax(x, dim=1)
            return output

    def training_data(self):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        loader_arguments = { 'shuffle': True}
        return DataManager(dataset=dataset1, **loader_arguments)

    def training_step(self, data, target):
        output = self.model().forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss
import torch import torch.nn as nn from fedbiomed.common.training_plans import TorchTrainingPlan from fedbiomed.common.data import DataManager from torchvision import datasets, transforms # Here we define the training plan to be used. class MyTrainingPlan(TorchTrainingPlan): # Defines and return model def init_model(self, model_args): return self.Net(model_args = model_args) # Defines and return optimizer def init_optimizer(self, optimizer_args): return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"]) # Declares and returns dependencies def init_dependencies(self): deps = ["from torchvision import datasets, transforms"] return deps class Net(nn.Module): def __init__(self, model_args): super().__init__() self.conv1 = nn.Conv2d(1, 32, 3, 1) self.conv2 = nn.Conv2d(32, 64, 3, 1) self.dropout1 = nn.Dropout(0.25) self.dropout2 = nn.Dropout(0.5) self.fc1 = nn.Linear(9216, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = F.max_pool2d(x, 2) x = self.dropout1(x) x = torch.flatten(x, 1) x = self.fc1(x) x = F.relu(x) x = self.dropout2(x) x = self.fc2(x) output = F.log_softmax(x, dim=1) return output def training_data(self): # Custom torch Dataloader for MNIST data transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]) dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform) loader_arguments = { 'shuffle': True} return DataManager(dataset=dataset1, **loader_arguments) def training_step(self, data, target): output = self.model().forward(data) loss = torch.nn.functional.nll_loss(output, target) return loss

Define the experiment parameters¶

training_args are used by the researcher to request the nodes to use GPU for training, if the node has a GPU and offers to use it.

In [ ]:
Copied!
model_args = {}

training_args = {
    'loader_args': { 'batch_size': 48, },
    'optimizer_args': {
        'lr': 1e-3
    },
    'use_gpu': True, # Activates GPU
    'epochs': 1,
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
model_args = {} training_args = { 'loader_args': { 'batch_size': 48, }, 'optimizer_args': { 'lr': 1e-3 }, 'use_gpu': True, # Activates GPU 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples }

Declare and run the experiment¶

All this part is the same as when running a model using CPU : experiment declaration and running is unchanged

In [ ]:
Copied!
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)
from fedbiomed.researcher.federated_workflows import Experiment from fedbiomed.researcher.aggregators.fedavg import FedAverage tags = ['#MNIST', '#dataset'] rounds = 2 exp = Experiment(tags=tags, model_args=model_args, training_plan_class=MyTrainingPlan, training_args=training_args, round_limit=rounds, aggregator=FedAverage(), node_selection_strategy=None)

Let's start the experiment.

By default, this function doesn't stop until all the round_limit rounds are done for all the nodes

In [ ]:
Copied!
exp.run()
exp.run()

You have completed training a TorchTrainingPlan using a GPU for acceleration.

Download Notebook
  • Introduction
  • Set up the nodes up
  • Define the training plan
  • Define the experiment parameters
  • Declare and run the experiment
Address:

2004 Rte des Lucioles, 06902 Sophia Antipolis

E-mail:

fedbiomed _at_ inria _dot_ fr

Fed-BioMed © 2022