Classes that simplify imports from fedbiomed.common.data
Classes
DataLoadingBlock
DataLoadingBlock()
Bases: ABC
The building blocks of a DataLoadingPlan.
A DataLoadingBlock describes an intermediary layer between the researcher and the node's filesystem. It allows the node to specify a customization in the way data is "perceived" by the data loaders during training.
A DataLoadingBlock is identified by its type_id attribute. Thus, this attribute should be unique among all DataLoadingBlockTypes in the same DataLoadingPlan. Moreover, we may test equality between a DataLoadingBlock and a string by checking its type_id, as a means of easily testing whether a DataLoadingBlock is contained in a collection.
Correct usage of this class requires creating ad-hoc subclasses. The DataLoadingBlock class is not intended to be instantiated directly.
Subclasses of DataLoadingBlock must respect the following conditions:
- implement a default constructor
- the implemented constructor must call
super().__init__()
- extend the serialize(self) and the deserialize(self, load_from: dict) functions
- both serialize and deserialize must call super's serialize and deserialize respectively
- the deserialize function must always return self
- the serialize function must update the dict returned by super's serialize
- implement an apply function that takes arbitrary arguments and applies the logic of the loading_block
- update the _validation_scheme to define rules for all new fields returned by the serialize function
Attributes:
Name | Type | Description |
---|---|---|
__serialization_id | (str) identifies one serialized instance of the DataLoadingBlock |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self.__serialization_id = 'serialized_dlb_' + str(uuid.uuid4())
self._serialization_validator = SerializationValidation()
self._serialization_validator.update_validation_scheme(SerializationValidation.dlb_default_scheme())
Functions
DataLoadingPlan
DataLoadingPlan(*args, **kwargs)
Bases: Dict[DataLoadingBlockTypes, DataLoadingBlock]
Customizations to the way the data is loaded and presented for training.
A DataLoadingPlan is a dictionary of {name: DataLoadingBlock} pairs. Each DataLoadingBlock represents a customization to the way data is loaded and presented to the researcher. These customizations are defined by the node, but they operate on a Dataset class, which is defined by the library and instantiated by the researcher.
To exploit this functionality, a Dataset must be modified to accept the customizations provided by the DataLoadingPlan. To simplify this process, we provide the DataLoadingPlanMixin class below.
The DataLoadingPlan class should be instantiated directly, no subclassing is needed. The DataLoadingPlan is a dict, and exposes the same interface as a dict.
Attributes:
Name | Type | Description |
---|---|---|
dlp_id | str representing a unique plan id (auto-generated) | |
desc | str representing an optional user-friendly short description | |
target_dataset_type | a DatasetTypes enum representing the type of dataset targeted by this DataLoadingPlan |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self, *args, **kwargs):
super(DataLoadingPlan, self).__init__(*args, **kwargs)
self.dlp_id = 'dlp_' + str(uuid.uuid4())
self.desc = ""
self.target_dataset_type = DatasetTypes.NONE
self._serialization_validation = SerializationValidation()
self._serialization_validation.update_validation_scheme(SerializationValidation.dlp_default_scheme())
Attributes
Functions
DataLoadingPlanMixin
DataLoadingPlanMixin()
Utility class to enable DLP functionality in a dataset.
Any Dataset class that inherits from [DataLoadingPlanMixin] will have the basic tools necessary to support a DataLoadingPlan. Typically, the logic of each specific DataLoadingBlock in the DataLoadingPlan will be implemented in the form of hooks that are called within the Dataset's implementation using the helper function apply_dlb defined below.
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self._dlp = None
Functions
DataManager
DataManager(dataset, target=None, **kwargs)
Bases: object
Factory class that build different data loader/datasets based on the type of dataset
. The argument dataset
should be provided as torch.utils.data.Dataset
object for to be used in PyTorch training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | Union[ndarray, DataFrame, Series, Dataset] | Dataset object. It can be an instance, PyTorch Dataset or Tuple. | required |
target | Union[ndarray, DataFrame, Series] | Target variable or variables. | None |
**kwargs | dict | Additional parameters that are going to be used for data loader | {} |
Source code in fedbiomed/common/data/_data_manager.py
def __init__(self,
dataset: Union[np.ndarray, pd.DataFrame, pd.Series, Dataset],
target: Union[np.ndarray, pd.DataFrame, pd.Series] = None,
**kwargs: dict) -> None:
"""Constructor of DataManager,
Args:
dataset: Dataset object. It can be an instance, PyTorch Dataset or Tuple.
target: Target variable or variables.
**kwargs: Additional parameters that are going to be used for data loader
"""
# TODO: Improve datamanager for auto loading by given dataset_path and other information
# such as inputs variable indexes and target variables indexes
self._dataset = dataset
self._target = target
self._loader_arguments: Dict = kwargs
self._data_manager_instance = None
Functions
FlambyDataset
FlambyDataset()
Bases: DataLoadingPlanMixin
, Dataset
A federated Flamby dataset.
A FlambyDataset is a wrapper around a flamby FedClass instance, adding functionalities and interfaces that are specific to Fed-BioMed.
A FlambyDataset is always created in an empty state, and it requires a DataLoadingPlan to be finalized to a correct state. The DataLoadingPlan must contain at least the following DataLoadinBlock key-value pair: - FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA : FlambyDatasetMetadataBlock
The lifecycle of the DataLoadingPlan and the wrapped FedClass are tightly interlinked: when the DataLoadingPlan is set, the wrapped FedClass is initialized and instantiated. When the DataLoadingPlan is cleared, the wrapped FedClass is also cleared. Hence, an invariant of this class is that the self._dlp and self.__flamby_fed_class should always be either both None, or both set to some value.
Attributes:
Name | Type | Description |
---|---|---|
_transform | a transform function of type MonaiTransform or TorchTransform that will be applied to every sample when data is loaded. | |
__flamby_fed_class | a private instance of the wrapped Flamby FedClass |
Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
super().__init__()
self.__flamby_fed_class = None
self._transform = None
Functions
FlambyDatasetMetadataBlock
FlambyDatasetMetadataBlock()
Bases: DataLoadingBlock
Metadata about a Flamby Dataset.
Includes information on: - identity of the type of flamby dataset (e.g. fed_ixi, fed_heart, etc...) - the ID of the center of the flamby dataset
Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
super().__init__()
self.metadata = {
"flamby_dataset_name": None,
"flamby_center_id": None
}
self._serialization_validator.update_validation_scheme(
FlambyDatasetMetadataBlock._extra_validation_scheme())
Attributes
Functions
FlambyLoadingBlockTypes
Bases: DataLoadingBlockTypes
, Enum
Additional DataLoadingBlockTypes specific to Flamby data
Attributes
MapperBlock
MapperBlock()
Bases: DataLoadingBlock
A DataLoadingBlock for mapping values.
This DataLoadingBlock can be used whenever an "indirect mapping" is needed. For example, it can be used to implement a correspondence between a set of "logical" abstract names and a set of folder names on the filesystem.
The apply function of this DataLoadingBlock takes a "key" as input (a str) and returns the mapped value corresponding to map[key]. Note that while the constructor of this class sets a value for type_id, developers are recommended to set a more meaningful value that better speaks to their application.
Multiple instances of this loading_block may be used in the same DataLoadingPlan, provided that they are given different type_id via the constructor.
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
super(MapperBlock, self).__init__()
self.map = {}
self._serialization_validator.update_validation_scheme(MapperBlock._extra_validation_scheme())
Attributes
Functions
MedicalFolderBase
MedicalFolderBase(root=None)
Bases: DataLoadingPlanMixin
Controller class for Medical Folder dataset.
Contains methods to validate the MedicalFolder folder hierarchy and extract folder-base metadata information such as modalities, number of subject etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, Path, None] | path to Medical Folder root folder. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, Path, None] = None):
"""Constructs MedicalFolderBase
Args:
root: path to Medical Folder root folder.
"""
super(MedicalFolderBase, self).__init__()
if root is not None:
root = self.validate_MedicalFolder_root_folder(root)
self._root = root
Attributes
Functions
MedicalFolderController
MedicalFolderController(root=None)
Bases: MedicalFolderBase
Utility class to construct and verify Medical Folder datasets without knowledge of the experiment.
The purpose of this class is to enable key functionalities related to the MedicalFolderDataset at the time of dataset deployment, i.e. when the data is being added to the node's database.
Specifically, the MedicalFolderController class can be used to: - construct a MedicalFolderDataset with all available data modalities, without knowing which ones will be used as targets or features during an experiment - validate that the proper folder structure has been respected by the data managers preparing the data - identify which subjects have which modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | str | Folder path to dataset. Defaults to None. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: str = None):
"""Constructs MedicalFolderController
Args:
root: Folder path to dataset. Defaults to None.
"""
super(MedicalFolderController, self).__init__(root=root)
Functions
MedicalFolderDataset
MedicalFolderDataset(root, data_modalities='T1', transform=None, target_modalities='label', target_transform=None, demographics_transform=None, tabular_file=None, index_col=None)
Bases: Dataset
, MedicalFolderBase
Torch dataset following the Medical Folder Structure.
The Medical Folder structure is loosely inspired by the BIDS standard [1]. It should respect the following pattern:
└─ MedicalFolder_root/
└─ demographics.csv
└─ sub-01/
├─ T1/
│ └─ sub-01_xxx.nii.gz
└─ T2/
├─ sub-01_xxx.nii.gz
[1] https://bids.neuroimaging.io/
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, PathLike, Path] | Root folder containing all the subject directories. | required |
data_modalities | (str, Iterable) | Modality or modalities to be used as data sources. | 'T1' |
transform | Union[Callable, Dict[str, Callable]] | A function or dict of function transform(s) that preprocess each data source. | None |
target_modalities | Optional[Union[str, Iterable[str]]] | (str, Iterable): Modality or modalities to be used as target sources. | 'label' |
target_transform | Union[Callable, Dict[str, Callable]] | A function or dict of function transform(s) that preprocess each target source. | None |
demographics_transform | Optional[Callable] | TODO | None |
tabular_file | Union[str, PathLike, Path, None] | Path to a CSV or Excel file containing the demographic information from the patients. | None |
index_col | Union[int, str, None] | Column name in the tabular file containing the subject ids which mush match the folder names. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self,
root: Union[str, PathLike, Path],
data_modalities: Optional[Union[str, Iterable[str]]] = 'T1',
transform: Union[Callable, Dict[str, Callable]] = None,
target_modalities: Optional[Union[str, Iterable[str]]] = 'label',
target_transform: Union[Callable, Dict[str, Callable]] = None,
demographics_transform: Optional[Callable] = None,
tabular_file: Union[str, PathLike, Path, None] = None,
index_col: Union[int, str, None] = None,
):
"""Constructor for class `MedicalFolderDataset`.
Args:
root: Root folder containing all the subject directories.
data_modalities (str, Iterable): Modality or modalities to be used as data sources.
transform: A function or dict of function transform(s) that preprocess each data source.
target_modalities: (str, Iterable): Modality or modalities to be used as target sources.
target_transform: A function or dict of function transform(s) that preprocess each target source.
demographics_transform: TODO
tabular_file: Path to a CSV or Excel file containing the demographic information from the patients.
index_col: Column name in the tabular file containing the subject ids which mush match the folder names.
"""
super(MedicalFolderDataset, self).__init__(root=root)
self._tabular_file = tabular_file
self._index_col = index_col
self._data_modalities = [data_modalities] if isinstance(data_modalities, str) else data_modalities
self._target_modalities = [target_modalities] if isinstance(target_modalities, str) else target_modalities
self._transform = self._check_and_reformat_transforms(transform, data_modalities)
self._target_transform = self._check_and_reformat_transforms(target_transform, target_modalities)
self._demographics_transform = demographics_transform if demographics_transform is not None else lambda x: {}
# Image loader
self._reader = Compose([
LoadImage(ITKReader(), image_only=True),
ToTensor()
])
Attributes
Functions
MedicalFolderLoadingBlockTypes
Bases: DataLoadingBlockTypes
, Enum
Attributes
NIFTIFolderDataset
NIFTIFolderDataset(root, transform=None, target_transform=None)
Bases: Dataset
A Generic class for loading NIFTI Images using the folder structure as the target classes' labels.
Supported formats: - NIFTI and compressed NIFTI files: .nii
, .nii.gz
This is a Dataset useful in classification tasks. Its usage is quite simple, quite similar to torchvision.datasets.ImageFolder
. Images must be contained in first level sub-folders (level 2+ sub-folders are ignored) that describe the target class they belong to (target class label is the name of the folder).
nifti_dataset_root_folder
├── control_group
│ ├── subject_1.nii
│ └── subject_2.nii
│ └── ...
└── disease_group
├── subject_3.nii
└── subject_4.nii
└── ...
In this example, there are 4 samples (one from each *.nii file), 2 target class, with labels control_group
and disease_group
. subject_1.nii
has class label control_group
, subject_3.nii
has class label disease_group
,etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, PathLike, Path] | folder where the data is located. | required |
transform | Union[Callable, None] | transforms to be applied on data. | None |
target_transform | Union[Callable, None] | transforms to be applied on target indexes. | None |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | bad argument type |
FedbiomedDatasetError | bad root path |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, PathLike, Path],
transform: Union[Callable, None] = None,
target_transform: Union[Callable, None] = None
):
"""Constructor of the class
Args:
root: folder where the data is located.
transform: transforms to be applied on data.
target_transform: transforms to be applied on target indexes.
Raises:
FedbiomedDatasetError: bad argument type
FedbiomedDatasetError: bad root path
"""
# check parameters type
for tr, trname in ((transform, 'transform'), (target_transform, 'target_transform')):
if not callable(tr) and tr is not None:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter {trname} has incorrect "
f"type {type(tr)}, cannot create dataset.")
if not isinstance(root, str) and not isinstance(root, PathLike) and not isinstance(root, Path):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter `root` has incorrect type "
f"{type(root)}, cannot create dataset.")
# initialize object variables
self._files = []
self._class_labels = []
self._targets = []
try:
self._root_dir = Path(root).expanduser()
except RuntimeError as e:
raise FedbiomedDatasetError(
f"{ErrorNumbers.FB612.value}: Cannot expand path {root}, error message is: {e}")
self._transform = transform
self._target_transform = target_transform
self._reader = Compose([
LoadImage(ITKReader(), image_only=True),
ToTensor()
])
self._explore_root_folder()
Functions
NPDataLoader
NPDataLoader(dataset, target, batch_size=1, shuffle=False, random_seed=None, drop_last=False)
DataLoader for a Numpy dataset.
This data loader encapsulates a dataset composed of numpy arrays and presents an Iterable interface. One design principle was to try to make the interface as similar as possible to a torch.DataLoader.
Attributes:
Name | Type | Description |
---|---|---|
_dataset | (np.ndarray) a 2d array of features | |
_target | (np.ndarray) an optional array of target values | |
_batch_size | (int) the number of elements in one batch | |
_shuffle | (bool) if True, shuffle the data at the beginning of every epoch | |
_drop_last | (bool) if True, drop the last batch if it does not contain batch_size elements | |
_rng | (np.random.Generator) the random number generator for shuffling |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | ndarray | 2D Numpy array | required |
target | ndarray | Numpy array of target values | required |
batch_size | int | batch size for each iteration | 1 |
shuffle | bool | shuffle before iteration | False |
random_seed | Optional[int] | an optional integer to set the numpy random seed for shuffling. If it equals None, then no attempt will be made to set the random seed. | None |
drop_last | bool | whether to drop the last batch in case it does not fill the whole batch size | False |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
dataset: np.ndarray,
target: np.ndarray,
batch_size: int = 1,
shuffle: bool = False,
random_seed: Optional[int] = None,
drop_last: bool = False):
"""Construct numpy data loader
Args:
dataset: 2D Numpy array
target: Numpy array of target values
batch_size: batch size for each iteration
shuffle: shuffle before iteration
random_seed: an optional integer to set the numpy random seed for shuffling. If it equals
None, then no attempt will be made to set the random seed.
drop_last: whether to drop the last batch in case it does not fill the whole batch size
"""
if not isinstance(dataset, np.ndarray) or not isinstance(target, np.ndarray):
msg = f"{ErrorNumbers.FB609.value}. Wrong input type for `dataset` or `target` in NPDataLoader. " \
f"Expected type np.ndarray for both, instead got {type(dataset)} and" \
f"{type(target)} respectively."
logger.error(msg)
raise FedbiomedTypeError(msg)
# If the researcher gave a 1-dimensional dataset, we expand it to 2 dimensions
if dataset.ndim == 1:
logger.info(f"NPDataLoader expanding 1-dimensional dataset to become 2-dimensional.")
dataset = dataset[:, np.newaxis]
# If the researcher gave a 1-dimensional target, we expand it to 2 dimensions
if target.ndim == 1:
logger.info(f"NPDataLoader expanding 1-dimensional target to become 2-dimensional.")
target = target[:, np.newaxis]
if dataset.ndim != 2 or target.ndim != 2:
msg = f"{ErrorNumbers.FB609.value}. Wrong shape for `dataset` or `target` in NPDataLoader. " \
f"Expected 2-dimensional arrays, instead got {dataset.ndim}-dimensional " \
f"and {target.ndim}-dimensional arrays respectively."
logger.error(msg)
raise FedbiomedValueError(msg)
if len(dataset) != len(target):
msg = f"{ErrorNumbers.FB609.value}. Inconsistent length for `dataset` and `target` in NPDataLoader. " \
f"Expected same length, instead got len(dataset)={len(dataset)}, len(target)={len(target)}"
logger.error(msg)
raise FedbiomedValueError(msg)
if not isinstance(batch_size, int):
msg = f"{ErrorNumbers.FB609.value}. Wrong type for `batch_size` parameter of NPDataLoader. Expected a " \
f"non-zero positive integer, instead got type {type(batch_size)}."
logger.error(msg)
raise FedbiomedTypeError(msg)
if batch_size <= 0:
msg = f"{ErrorNumbers.FB609.value}. Wrong value for `batch_size` parameter of NPDataLoader. Expected a " \
f"non-zero positive integer, instead got value {batch_size}."
logger.error(msg)
raise FedbiomedValueError(msg)
if not isinstance(shuffle, bool):
msg = f"{ErrorNumbers.FB609.value}. Wrong type for `shuffle` parameter of NPDataLoader. Expected `bool`, " \
f"instead got {type(shuffle)}."
logger.error(msg)
raise FedbiomedTypeError(msg)
if not isinstance(drop_last, bool):
msg = f"{ErrorNumbers.FB609.value}. Wrong type for `drop_last` parameter of NPDataLoader. " \
f"Expected `bool`, instead got {type(drop_last)}."
logger.error(msg)
raise FedbiomedTypeError(msg)
if random_seed is not None and not isinstance(random_seed, int):
msg = f"{ErrorNumbers.FB609.value}. Wrong type for `random_seed` parameter of NPDataLoader. " \
f"Expected int or None, instead got {type(random_seed)}."
logger.error(msg)
raise FedbiomedTypeError(msg)
self._dataset = dataset
self._target = target
self._batch_size = batch_size
self._shuffle = shuffle
self._drop_last = drop_last
self._rng = np.random.default_rng(random_seed)
Attributes
Functions
SerializationValidation
SerializationValidation()
Provide Validation capabilities for serializing/deserializing a [DataLoadingBlock] or [DataLoadingPlan].
When a developer inherits from [DataLoadingBlock] to define a custom loading block, they are required to call the _serialization_validator.update_validation_scheme
function with a dictionary argument containing the rules to validate all the additional fields that will be used in the serialization of their loading block.
These rules must follow the syntax explained in the SchemeValidator class.
For example
class MyLoadingBlock(DataLoadingBlock):
def __init__(self):
self.my_custom_data = {}
self._serialization_validator.update_validation_scheme({
'custom_data': {
'rules': [dict, ...any other rules],
'required': True
}
})
def serialize(self):
serialized = super().serialize()
serialized.update({'custom_data': self.my_custom_data})
return serialized
Attributes:
Name | Type | Description |
---|---|---|
_validation_scheme | (dict) an extensible set of rules to validate the DataLoadingBlock metadata. |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self._validation_scheme = {}
Functions
SkLearnDataManager
SkLearnDataManager(inputs, target, **kwargs)
Bases: object
Wrapper for pd.DataFrame
, pd.Series
and np.ndarray
datasets.
Manages datasets for scikit-learn based model training. Responsible for managing inputs, and target variables that have been provided in training_data
of scikit-learn based training plans.
The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the [fedbiomed.common.data.NPDataLoader] class for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs | Union[ndarray, DataFrame, Series] | Independent variables (inputs, features) for model training | required |
target | Union[ndarray, DataFrame, Series] | Dependent variable/s (target) for model training and validation | required |
**kwargs | dict | Loader arguments | {} |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
target: Union[np.ndarray, pd.DataFrame, pd.Series],
**kwargs: dict):
""" Construct a SkLearnDataManager from an array of inputs and an array of targets.
The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated
when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the
[fedbiomed.common.data.NPDataLoader] class for more details.
Args:
inputs: Independent variables (inputs, features) for model training
target: Dependent variable/s (target) for model training and validation
**kwargs: Loader arguments
"""
if not isinstance(inputs, (np.ndarray, pd.DataFrame, pd.Series)) or \
not isinstance(target, (np.ndarray, pd.DataFrame, pd.Series)):
msg = f"{ErrorNumbers.FB609.value}. Parameters `inputs` and `target` for " \
f"initialization of {self.__class__.__name__} should be one of np.ndarray, pd.DataFrame, pd.Series"
logger.error(msg)
raise FedbiomedTypeError(msg)
# Convert pd.DataFrame or pd.Series to np.ndarray for `inputs`
if isinstance(inputs, (pd.DataFrame, pd.Series)):
self._inputs = inputs.to_numpy()
else:
self._inputs = inputs
# Convert pd.DataFrame or pd.Series to np.ndarray for `target`
if isinstance(target, (pd.DataFrame, pd.Series)):
self._target = target.to_numpy()
else:
self._target = target
# Additional loader arguments
self._loader_arguments = kwargs
# Subset None means that train/validation split has not been performed
self._subset_test: Union[Tuple[np.ndarray, np.ndarray], None] = None
self._subset_train: Union[Tuple[np.ndarray, np.ndarray], None] = None
Functions
TabularDataset
TabularDataset(inputs, target)
Bases: Dataset
Torch based Dataset object to create torch Dataset from given numpy or dataframe type of input and target variables
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs | Union[ndarray, DataFrame, Series] | Input variables that will be passed to network | required |
target | Union[ndarray, DataFrame, Series] | Target variable for output layer | required |
Raises:
Type | Description |
---|---|
FedbiomedTorchDatasetError | If input variables and target variable does not have equal length/size |
Source code in fedbiomed/common/data/_tabular_dataset.py
def __init__(self,
inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
target: Union[np.ndarray, pd.DataFrame, pd.Series]):
"""Constructs PyTorch dataset object
Args:
inputs: Input variables that will be passed to network
target: Target variable for output layer
Raises:
FedbiomedTorchDatasetError: If input variables and target variable does not have
equal length/size
"""
# Inputs and target variable should be converted to the torch tensors
# PyTorch provides `from_numpy` function to convert numpy arrays to
# torch tensor. Therefore, if the arguments `inputs` and `target` are
# instance one of `pd.DataFrame` or `pd.Series`, they should be converted to
# numpy arrays
if isinstance(inputs, (pd.DataFrame, pd.Series)):
self.inputs = inputs.to_numpy()
elif isinstance(inputs, np.ndarray):
self.inputs = inputs
else:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `inputs` should be "
f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
# Configuring self.target attribute
if isinstance(target, (pd.DataFrame, pd.Series)):
self.target = target.to_numpy()
elif isinstance(inputs, np.ndarray):
self.target = target
else:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `target` should be "
f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
# The lengths should be equal
if len(self.inputs) != len(self.target):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: Length of input variables and target "
f"variable does not match. Please make sure that they have "
f"equal size while creating the method `training_data` of "
f"TrainingPlan")
# Convert `inputs` adn `target` to Torch floats
self.inputs = from_numpy(self.inputs).float()
self.target = from_numpy(self.target).float()
Attributes
Functions
TorchDataManager
TorchDataManager(dataset, **kwargs)
Bases: object
Wrapper for PyTorch Dataset to manage loading operations for validation and train.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | Dataset | Dataset object for torch.utils.data.DataLoader | required |
**kwargs | dict | Arguments for PyTorch | {} |
Raises:
Type | Description |
---|---|
FedbiomedTorchDataManagerError | If the argument |
Source code in fedbiomed/common/data/_torch_data_manager.py
def __init__(self, dataset: Dataset, **kwargs: dict):
"""Construct of class
Args:
dataset: Dataset object for torch.utils.data.DataLoader
**kwargs: Arguments for PyTorch `DataLoader`
Raises:
FedbiomedTorchDataManagerError: If the argument `dataset` is not an instance of `torch.utils.data.Dataset`
"""
# TorchDataManager should get `dataset` argument as an instance of torch.utils.data.Dataset
if not isinstance(dataset, Dataset):
raise FedbiomedTorchDataManagerError(
f"{ErrorNumbers.FB608.value}: The attribute `dataset` should an instance "
f"of `torch.utils.data.Dataset`, please use `Dataset` as parent class for"
f"your custom torch dataset object")
self._dataset = dataset
self._loader_arguments = kwargs
self._subset_test: Union[Subset, None] = None
self._subset_train: Union[Subset, None] = None
Attributes
Functions
Functions
discover_flamby_datasets
discover_flamby_datasets()
Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.
Returns:
Name | Type | Description |
---|---|---|
Dict[int, str] | a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module | |
Dict[int, str] | corresponding to a dataset, represented as str. To import said module one must prepend with the correct | |
path | Dict[int, str] |
|
Source code in fedbiomed/common/data/_flamby_dataset.py
def discover_flamby_datasets() -> Dict[int, str]:
"""Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.
Returns:
a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module
corresponding to a dataset, represented as str. To import said module one must prepend with the correct
path: `import flamby.datasets.dataset_name`.
"""
dataset_list = [name for _, name, ispkg in pkgutil.iter_modules(flamby_datasets_module.__path__) if ispkg]
return {i: name for i, name in enumerate(dataset_list)}