Loading Dataset for Training
Datasets in the nodes are saved on the disk. Therefore, before the training, each node should load these datasets from the file system. Since the type of datasets (image, tabular, etc.) and the way of loading might vary from one to another, the user (researcher) should define a method called training_data
. The method training_data
is mandatory for each training plan (TrochTrainingPlan
and SkLearnSGDModel
). If it is not defined nodes will return an error at the very beginning of the first round.
Defining The Method Training Data
The method training_data
defines the logic related to loading data on the node. In particular, it defines:
- the type of data and the
Dataset
class - any preprocessing (either data transforms, imputation, or augmentation)
- also implicitly defines the
DataLoader
for iterating over the data
This method takes no inputs and returns a DataManager
, therefore its signature is:
def training_data(self) -> fedbiomed.common.data.DataManager
The training_data
method is always part of the training plan, as follows:
from fedbiomed.common.training_plans import TorchTrainingPlan
class MyTrainingPlan(TorchTrainingPlan):
def __init__(self):
pass
# ....
def training_data(self):
pass
For details on how arguments are passed to the data loader, please refer to the section below Passing arguments to data loaders.
The DataManager
return type
The method training_data
should always return DataManager
of Fed-BioMed defined in the module fedbiomed.common.data.DataManager
. DataManager
has been designed for managing different types of data objects for different types of training plans. It is also responsible for splitting a given dataset into training and validation if model validation is activated in the experiment.
What is a DataManager
?
A DataManager
is a Fed-BioMed concept that makes the link between a Dataset
and the corresponding DataLoader
. It has a generic interface that is framework-agnostic (Pytorch, sklearn, etc...)
DataManager
takes two main input arguments as dataset
and target
. dataset
should be an instance of one of PyTorch Dataset
, Numpy ndarray
, pd.DataFrame
or pd.Series
. The argument target
should be an instance of one of Numpy ndarray
, pd.DataFrame
or pd.Series
. By default, the argument target
is None
. If target
is None
the data manager considers that the dataset
is an object that includes both input and target variables. This is the case where the dataset is an instance of the PyTorch dataset. If dataset
is an instance of Numpy Array
or Pandas DataFrame
, it is mandatory to provide the target
variable.
As it is mentioned, DataManager
is capable of managing/configuring datasets/data-loaders based on the training plans that are going to be used for training. This configuration is necessary since each training plan requires different types of data loader/batch iterator.
Defining Training Data in Different Training Plans
Defining Training Data for PyTorch Based Training Plans
In the following code snippet, training_data
of PyTorch-based training plan returns a DataManager
object instantiated with dataset
and target
as pd.Series
. Since PyTorch-based training requires a PyTorch DataLoader
, DataManager
converts pd.Series
to a proper torch.utils.data.Dataset
object and create a PyTorch DataLoader
to pass it to the training loop on the node side.
import pandas as pd
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
class MyTrainingPlan(TorchTrainingPlan):
def init_model(self):
# ....
def init_dependencies(self):
# ....
def init_optimizer(self):
# ....
def training_data(self):
feature_cols = self.model_args()["feature_cols"]
dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
X = dataset.iloc[:,0:feature_cols].values
y = dataset.iloc[:,feature_cols]
return DataManager(dataset=X, target=y.values, )
It is also possible to define a custom PyTorch Dataset
and use it in the DataManager
without declaring the argument target
.
import pandas as pd
from torch.utils.data import Dataset
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
class MyTrainingPlan(TorchTrainingPlan):
class CSVDataset(Dataset):
""" Cusotm PyTorch Dataset """
def __init__(self, dataset_path, features):
self.input_file = pd.read_csv(dataset_path,sep=',',index_col=False)
x_train = self.input_file.iloc[:,0:features].values
y_train = self.input_file.iloc[:,features].values
self.X_train = torch.from_numpy(x_train).float()
self.Y_train = torch.from_numpy(y_train).float()
def __len__(self):
return len(self.Y_train)
def __getitem__(self, idx):
return self.X_train[idx], self.Y_train[idx]
def training_data(self):
feature_cols = self.model_args()["feature_cols"]
dataset = self.CSVDataset(self.dataset_path, feature_cols)
loader_kwargs = {'batch_size': batch_size, 'shuffle': True}
return DataManager(dataset=dataset, **loader_kwargs)
loader_kwargs
contains the arguments that are going to be used while creating a PyTorch DataLoader
. Defining Training Data for SkLearn Based Training Plans
The operations in the training_data
for SkLearn based training plans are not much different thanTorchTrainingPlan
. Currently, SkLearn based training plans do not require a data loader for training. This means that all samples will be used for fitting the model. That's why passing **loader_args
does not make sense for SkLearn based training plans. These arguments will be ignored even if they are set.
import pandas as pd
from fedbiomed.common.training_plans import FedPerceptron
from fedbiomed.common.data import DataManager
class SGDRegressorTrainingPlan(FedPerceptron):
def training_data(self):
num_cols = self.model_args()["number_cols"]
dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
X = dataset.iloc[:,0:num_cols].values
y = dataset.iloc[:,num_cols]
return DataManager(dataset=X, target=y.values, batch_size)
Preprocessing for Data
Since the method training_data
is defined by the user, it is possible to do preprocessing before creating the DataManager
object. In the code snippet below, a preprocess for normalization is shown for the dataset MNIST.
def training_data(self):
# Custom torch Dataloader for MNIST data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
dataset_mnist = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
train_kwargs = {'batch_size': batch_size, 'shuffle': True}
return DataManager(dataset=dataset_mnist, **train_kwargs)
Training and validation partitions are created on the node side using returned DataManager
object. Therefore, preprocessing in training_data
will be applied for both validation and training data.
Data Loaders
A DataLoader
is a class that takes care of handling the logic of iterating over a certain dataset. Thus, while a Dataset
is concerned with loading and preprocessing samples one-by-one, the DataLoader
is responsible for:
- calling the dataset's
__getitem__
method when needed - collating samples in a batch
- shuffling the data at every epoch
- in general, managing the actions related to iterating over a certain dataset
Passing arguments to Data Loaders
All of the key-value pairs contained in the loader_args
sub-dictionary of training_args
are passed as keyword arguments to the data loader. Additionally, any keyword arguments passed to the DataManager
class inside the training_data
function are also passed as keyword arguments to the data loader.
For example, the following setup:
class MyTrainingPlan(TorchTrainingPlan):
# ....
def training_data(self):
dataset = MyDataset()
return DataManager(dataset, shuffle=True)
training_args = {
'loader_args': {
'batch_size': 5,
'drop_last': True
}
}
Leads to the following data loader definition:
loader = DataLoader(dataset, shuffle=True, batch_size=5, drop_last=True)
Double-specified loader arguments
Keyword arguments passed to the DataManager
class take precedence over arguments with the same name provided in the loader_args
dictionary.
For PyTorch and scikit-learn experiments, the DataLoaders
have been heavily inspired by the torch.utils.data.DataLoader
class, so please refer to that documentation for the meaning of the supported keyword arguments.
Conclusion
training_data
should be provided in each training plan. The way it is defined is almost the same for each framework's training plan as long as the structure of the datasets is simple. Since the method is defined by users, it provides flexibility to load and pre-process complex datasets distributed on the nodes. However, this method will be executed on the node side. Therefore, typos and lack of arguments may cause errors in the nodes even if it does not create any errors on the researcher side.