Training Process with Training Plan Management¶
Introduction¶
Fed-BioMed offers a feature to run only the pre-approved training plans on the nodes by default. The nodes which receive your training plan might require approved training plans. Therefore, if the node accepts only the approved training plan, the training plan files that are sent by a researcher with the training request should be approved by the node side in advance. In this workflow, the training plan approval process is done by a real user/person who reviews the code contained in the training plan file/class. The reviewer makes sure the model doesn't contain any code that might cause privacy issues or harm the node.
In this tutorial, we will be creating a node with activated training plan control option.
Setting Up a Node¶
Enabling training plan control can be done both from config file or Fed-BioMed CLI while starting the node. The process of creating and starting a node with training plan control option is not so different from setting up a normal node. By default, if no option is specified in the CLI when the node is launched for the first time, the node disables training plan control in the security section of the config file. It then looks like the snippet below :
[security]
hashing_algorithm = SHA256
allow_default_training_plans = True
training_plan_approval = False
It is also possible to manage training plan approval mode using environment variables. ENABLE_TRAINING_PLAN_APPROVAL=True
and ALLOW_DEFAULt_TRAINING_PLANS=True
to activate training plan approval mode. They enable one-time override of the config file options at each launch of the node.
ENABLE_TRAINING_PLAN_APPROVAL=True
: This variable enables training plan control for the node. If there isn't a config file for the node while running CLI, it creates a new config file with enabled training plan approval modetraining_plan_approval = True
.ALLOW_DEFAULt_TRAINING_PLANS=True
: This variable allows default training plans for train requests. These are the training plans that come for Fed-BioMed tutorials. For example, the training plan for MNIST dataset that we will be using for this tutorial. If the default training plans are enabled, node updates/registers training plan files which are located inenvs/common/default_training_plans
directory during the starting process of the node. This option has no effect if training plan control is not enabled.
Adding MNIST Dataset to The Node.¶
In this section we will add MNIST dataset to the node. While adding the dataset through CLI we'll also specify ENABLE_TRAINING_PLAN_APPROVAL=True
and ALLOW_DEFAULt_TRAINING_PLANS=True
options. This will create new config-n1.ini
file with following configuration.
[security]
hashing_algorithm = SHA256
allow_default_training_plans = True
training_plan_approval = True
Now, let's run the following command.
$ ENABLE_TRAINING_PLAN_APPROVAL=True ALLOW_DEFAULT_TRAINING_PLANS=True ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini dataset add
The CLI will ask you to select the dataset type. Since we will be working on MNIST dataset, please select 2
(default) and continue by typing y
for the next prompt and select folder that you want to store MNIST dataset. Afterward, if you go to etc
directory of fedbiomed, you can see config-n1.ini
file.
Starting the Node¶
Now you can start your node by running following command;
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini start
Since config file has been configured to enable training plan control mode, you do not need to specify any extra parameter while starting the node. But it is also possible to start node with ENABLE_TRAINING_PLAN_APPROVAL=True
, ALLOW_DEFAULt_TRAINING_PLANS=True
or ENABLE_TRAINING_PLAN_APPROVAL=False
, ALLOW_DEFAULT_TRAINING_PLANS=False
. If you start your node with ENABLE_TRAINING_PLAN_APPROVAL=False
it will disable training plan control even it is enabled in the config file.
Creating An Experiment¶
In this section we will be using default MNIST model which has been already registered by the node.
The following model is the model that will be sent to the node for training. Since the model files are processed by the Experiment to configure dependencies, import part of the final file might be different from this one.
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms
# Here we define the training plan to be used.
class MyTrainingPlan(TorchTrainingPlan):
# Defines and return model
def init_model(self, model_args):
return self.Net(model_args = model_args)
# Defines and return optimizer
def init_optimizer(self, optimizer_args):
return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])
# Declares and return dependencies
def init_dependencies(self):
deps = ["from torchvision import datasets, transforms"]
return deps
class Net(nn.Module):
def __init__(self, model_args):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def training_data(self):
# Custom torch Dataloader for MNIST data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
loader_arguments = { 'shuffle': True}
return DataManager(dataset=dataset1, **loader_arguments)
def training_step(self, data, target):
output = self.model().forward(data)
loss = torch.nn.functional.nll_loss(output, target)
return loss
To be able to get/see the final model file we need to initialize the experiment.
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['#MNIST', '#dataset']
rounds = 2
model_args = {}
training_args = {
'loader_args': { 'batch_size': 48, },
'optimizer_args': {
"lr" : 1e-3
},
'epochs': 1,
'dry_run': False,
'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
exp = Experiment(tags=tags,
model_args=model_args,
training_plan_class=MyTrainingPlan,
training_args=training_args,
round_limit=rounds,
aggregator=FedAverage(),
node_selection_strategy=None)
Getting Final Training Plan File From Experiment¶
training_plan_file()
displays the training plan file that will be sent to the nodes.
exp.training_plan_file(display = True)
The exp.check_training_plan_status()
sends request to the experiment's nodes to check whether the model is approved or not. The nodes that will receive the requests are the nodes that have been found after searching datasets.
status = exp.check_training_plan_status()
status
exp.run_once()
The logs should indicate that the training plan is approved. You can also get status object from the result of the check_training_plan_status()
. It returns a list of status objects each for different node. Since we have only launched a single node, it returns only one status object.
approval_obligation
: Indicates whether the training plan control is enabled in the node.status
: Indicates training plan approval status.
Changing Training Plan And Testing Training Plan Approval Status¶
Let's change the training plan network codes and test whether it is approved or not. We will be changing the network structure.
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms
# Here we define the model to be used.
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
# Defines and return model
def init_model(self, model_args):
return self.Net(model_args = model_args)
# Defines and return optimizer
def init_optimizer(self, optimizer_args):
return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])
# Declares and return dependencies
def init_dependencies(self):
deps = ["from torchvision import datasets, transforms"]
return deps
class Net(nn.Module):
def __init__(self, model_args):
super().__init__()
self.conv1 = nn.Conv2d(1, 16, 5, 1, 2)
self.conv2 = nn.Conv2d(16, 32, 5, 1, 2)
self.fc1 = nn.Linear(32 * 7 * 7, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = torch.flatten(x, 1)
x = self.fc1(x)
output = F.log_softmax(x, dim=1)
return output
def training_data(self):
# Custom torch Dataloader for MNIST data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
train_kwargs = { 'shuffle': True}
return DataManager(dataset=dataset1, **train_kwargs)
def training_step(self, data, target):
output = self.model().forward(data)
loss = torch.nn.functional.nll_loss(output, target)
return loss
In the following cell, we update the training plan class using the setter set_training_plan_class
.
exp.set_training_plan_class(MyTrainingPlan, keep_weights=False)
Since we changed the model/network structure (we removed dropouts and one dense layer fc2
) in the experiment, the output of the following method should say that the training plan is not approved by the node and is_approved
key of the result object should be equal to False
.
status = exp.check_training_plan_status()
exp.training_plan_file()
status
Since the training plan is not approved, you won't be able to train your model in the node. The following cell will return an error.
exp.run_once(increase=True)
Registering and Approving the Training Plan¶
To register/approve the training plan that has been created in the previous section, we can use Fed-BioMed CLI. In Fed-Biomed, there are two ways of approving a model:
- By sending an
ApprovalRequest
from the researcher to theNode
- By adding it directly to the
Node
through model registration facility
1. Approving a Training Plan through an ApprovalRequest
¶
Fed-BioMed 's Experiment
interface provides a method to submit a training plan to the Node
, for approval. Node
can then review the code and approve the training plan using CLI or GUI.
The method of Experiment
sending such request is training_plan_approve
exp.training_plan_approve(description="my new training plans")
Once the training plan has been sent, we need to approve it (or reject it) on Node
side.
Before approving, optionally list models/training plans known to the node with their status (Approved
, Pending
, Rejected
). Your new training plan should appear with Pending
status and name my new training plan
.
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini training-plan list
Then approve the training plan, using the following command on a new terminal:
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini training-plan approve
Training plans with both Pending
or Rejected
status will be displayed. Select the training plan you have sent to approve it. You might see a message explaining that training plan has successfully been approved.
Optionally list again training plans known to the node with their status. Your training plan should now appear with Approved
status.
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini training-plan list
Back on the Researcher
side, let's check it status by running the check_model_status
command:
exp.check_training_plan_status()
Model's status must have changed from Pending
status to Approved
, which means model can be trained from now on on the Node
. Researcher
can now run an Experiment
on the Node
!
exp.run_once(increase=True)
2. Registering a Model through Node interface¶
Training plan status must have changed from Pending
status to Approved
, which means model can be trained from now on the Node
. Researcher
can now run an Experiment
on the Node
!
exp.training_plan_file()
The output of the exp.training_plan_file()
is a file path that shows where the final training plan is saved. It also prints the content of the training plan file. You can either get the content of training plan from the output cell or the path where it is saved. Anyway, you need to create a new txt
file and copy the training plan content in it. You can create new directory in Fed-BioMed called training_plans
and inside it, you can create new my-training-plan.txt
file and copy the training plan class content into it.
$ mkdir ${FEDBIOMED_DIR}/my_approved_training_plan
$ cp <training_plan_path_file> ${FEDBIOMED_DIR}/my_approved_training_plan/my-training-plan.txt
Where <model_path_file>
is the path of the model that is returned by exp.training_plan_file(display=False)
Afterward, please run following command in other terminal to register training plan file.
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini training-plan register
You should type a unique name for your training plan e.g. 'MyTestTP-1' and a description. The CLI will ask you select training plan file you want to register. Select the file that you saved and continue.
Now, you should be able to train your model defined in the training plan.
Back on the Researcher
side, you should now be able to train your model.
exp.check_training_plan_status()
exp.run_once(increase=True)
Rejecting training plans¶
On Node
side, it is possible to reject a Model using cli or GUI. Every type of training plan can be Rejected
, even Default
models. In Fed-BioMed, Rejected
means that training plan cannot be trained/executed on the Node
(but training plan is still Registered
into the database).
Using cli, Node
can run:
$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini training-plan reject
and select the training plan to be Rejected
.
exp.check_training_plan_status()
exp.run_once(increase=True)