Data

Classes that simplify imports from fedbiomed.common.data

Classes

DataLoadingBlock

DataLoadingBlock()

Bases: ABC

The building blocks of a DataLoadingPlan.

A DataLoadingBlock describes an intermediary layer between the researcher and the node's filesystem. It allows the node to specify a customization in the way data is "perceived" by the data loaders during training.

A DataLoadingBlock is identified by its type_id attribute. Thus, this attribute should be unique among all DataLoadingBlockTypes in the same DataLoadingPlan. Moreover, we may test equality between a DataLoadingBlock and a string by checking its type_id, as a means of easily testing whether a DataLoadingBlock is contained in a collection.

Correct usage of this class requires creating ad-hoc subclasses. The DataLoadingBlock class is not intended to be instantiated directly.

Subclasses of DataLoadingBlock must respect the following conditions:

  1. implement a default constructor
  2. the implemented constructor must call super().__init__()
  3. extend the serialize(self) and the deserialize(self, load_from: dict) functions
  4. both serialize and deserialize must call super's serialize and deserialize respectively
  5. the deserialize function must always return self
  6. the serialize function must update the dict returned by super's serialize
  7. implement an apply function that takes arbitrary arguments and applies the logic of the loading_block
  8. update the _validation_scheme to define rules for all new fields returned by the serialize function

Attributes:

Name Type Description
__serialization_id

(str) identifies one serialized instance of the DataLoadingBlock

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self.__serialization_id = 'serialized_dlb_' + str(uuid.uuid4())
    self._serialization_validator = SerializationValidation()
    self._serialization_validator.update_validation_scheme(SerializationValidation.dlb_default_scheme())

Functions

DataLoadingPlan

DataLoadingPlan(*args, **kwargs)

Bases: Dict[DataLoadingBlockTypes, DataLoadingBlock]

Customizations to the way the data is loaded and presented for training.

A DataLoadingPlan is a dictionary of {name: DataLoadingBlock} pairs. Each DataLoadingBlock represents a customization to the way data is loaded and presented to the researcher. These customizations are defined by the node, but they operate on a Dataset class, which is defined by the library and instantiated by the researcher.

To exploit this functionality, a Dataset must be modified to accept the customizations provided by the DataLoadingPlan. To simplify this process, we provide the DataLoadingPlanMixin class below.

The DataLoadingPlan class should be instantiated directly, no subclassing is needed. The DataLoadingPlan is a dict, and exposes the same interface as a dict.

Attributes:

Name Type Description
dlp_id

str representing a unique plan id (auto-generated)

desc

str representing an optional user-friendly short description

target_dataset_type

a DatasetTypes enum representing the type of dataset targeted by this DataLoadingPlan

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self, *args, **kwargs):
    super(DataLoadingPlan, self).__init__(*args, **kwargs)
    self.dlp_id = 'dlp_' + str(uuid.uuid4())
    self.desc = ""
    self.target_dataset_type = DatasetTypes.NONE
    self._serialization_validation = SerializationValidation()
    self._serialization_validation.update_validation_scheme(SerializationValidation.dlp_default_scheme())

Attributes

Functions

DataLoadingPlanMixin

DataLoadingPlanMixin()

Utility class to enable DLP functionality in a dataset.

Any Dataset class that inherits from [DataLoadingPlanMixin] will have the basic tools necessary to support a DataLoadingPlan. Typically, the logic of each specific DataLoadingBlock in the DataLoadingPlan will be implemented in the form of hooks that are called within the Dataset's implementation using the helper function apply_dlb defined below.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self._dlp = None

Functions

DataManager

DataManager(dataset, target=None, **kwargs)

Bases: object

Factory class that build different data loader/datasets based on the type of dataset. The argument dataset should be provided as torch.utils.data.Dataset object for to be used in PyTorch training.

Parameters:

Name Type Description Default
dataset Union[ndarray, DataFrame, Series, Dataset]

Dataset object. It can be an instance, PyTorch Dataset or Tuple.

required
target Union[ndarray, DataFrame, Series]

Target variable or variables.

None
**kwargs dict

Additional parameters that are going to be used for data loader

{}
Source code in fedbiomed/common/data/_data_manager.py
def __init__(self,
             dataset: Union[np.ndarray, pd.DataFrame, pd.Series, Dataset],
             target: Union[np.ndarray, pd.DataFrame, pd.Series] = None,
             **kwargs: dict) -> None:

    """Constructor of DataManager,

    Args:
        dataset: Dataset object. It can be an instance, PyTorch Dataset or Tuple.
        target: Target variable or variables.
        **kwargs: Additional parameters that are going to be used for data loader
    """

    # TODO: Improve datamanager for auto loading by given dataset_path and other information
    # such as inputs variable indexes and target variables indexes

    self._dataset = dataset
    self._target = target
    self._loader_arguments: Dict = kwargs
    self._data_manager_instance = None

Functions

FlambyDataset

FlambyDataset()

Bases: DataLoadingPlanMixin, Dataset

A federated Flamby dataset.

A FlambyDataset is a wrapper around a flamby FedClass instance, adding functionalities and interfaces that are specific to Fed-BioMed.

A FlambyDataset is always created in an empty state, and it requires a DataLoadingPlan to be finalized to a correct state. The DataLoadingPlan must contain at least the following DataLoadinBlock key-value pair: - FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA : FlambyDatasetMetadataBlock

The lifecycle of the DataLoadingPlan and the wrapped FedClass are tightly interlinked: when the DataLoadingPlan is set, the wrapped FedClass is initialized and instantiated. When the DataLoadingPlan is cleared, the wrapped FedClass is also cleared. Hence, an invariant of this class is that the self._dlp and self.__flamby_fed_class should always be either both None, or both set to some value.

Attributes:

Name Type Description
_transform

a transform function of type MonaiTransform or TorchTransform that will be applied to every sample when data is loaded.

__flamby_fed_class

a private instance of the wrapped Flamby FedClass

Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
    super().__init__()
    self.__flamby_fed_class = None
    self._transform = None

Functions

FlambyDatasetMetadataBlock

FlambyDatasetMetadataBlock()

Bases: DataLoadingBlock

Metadata about a Flamby Dataset.

Includes information on: - identity of the type of flamby dataset (e.g. fed_ixi, fed_heart, etc...) - the ID of the center of the flamby dataset

Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
    super().__init__()
    self.metadata = {
        "flamby_dataset_name": None,
        "flamby_center_id": None
    }
    self._serialization_validator.update_validation_scheme(
        FlambyDatasetMetadataBlock._extra_validation_scheme())

Attributes

Functions

FlambyLoadingBlockTypes

Bases: DataLoadingBlockTypes, Enum

Additional DataLoadingBlockTypes specific to Flamby data

Attributes

MapperBlock

MapperBlock()

Bases: DataLoadingBlock

A DataLoadingBlock for mapping values.

This DataLoadingBlock can be used whenever an "indirect mapping" is needed. For example, it can be used to implement a correspondence between a set of "logical" abstract names and a set of folder names on the filesystem.

The apply function of this DataLoadingBlock takes a "key" as input (a str) and returns the mapped value corresponding to map[key]. Note that while the constructor of this class sets a value for type_id, developers are recommended to set a more meaningful value that better speaks to their application.

Multiple instances of this loading_block may be used in the same DataLoadingPlan, provided that they are given different type_id via the constructor.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    super(MapperBlock, self).__init__()
    self.map = {}
    self._serialization_validator.update_validation_scheme(MapperBlock._extra_validation_scheme())

Attributes

Functions

MedicalFolderBase

MedicalFolderBase(root=None)

Bases: DataLoadingPlanMixin

Controller class for Medical Folder dataset.

Contains methods to validate the MedicalFolder folder hierarchy and extract folder-base metadata information such as modalities, number of subject etc.

Parameters:

Name Type Description Default
root Union[str, Path, None]

path to Medical Folder root folder.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, Path, None] = None):
    """Constructs MedicalFolderBase

    Args:
        root: path to Medical Folder root folder.
    """
    super(MedicalFolderBase, self).__init__()

    if root is not None:
        root = self.validate_MedicalFolder_root_folder(root)

    self._root = root

Attributes

Functions

MedicalFolderController

MedicalFolderController(root=None)

Bases: MedicalFolderBase

Utility class to construct and verify Medical Folder datasets without knowledge of the experiment.

The purpose of this class is to enable key functionalities related to the MedicalFolderDataset at the time of dataset deployment, i.e. when the data is being added to the node's database.

Specifically, the MedicalFolderController class can be used to: - construct a MedicalFolderDataset with all available data modalities, without knowing which ones will be used as targets or features during an experiment - validate that the proper folder structure has been respected by the data managers preparing the data - identify which subjects have which modalities

Parameters:

Name Type Description Default
root str

Folder path to dataset. Defaults to None.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: str = None):
    """Constructs MedicalFolderController

    Args:
        root: Folder path to dataset. Defaults to None.
    """
    super(MedicalFolderController, self).__init__(root=root)

Functions

MedicalFolderDataset

MedicalFolderDataset(root, data_modalities='T1', transform=None, target_modalities='label', target_transform=None, demographics_transform=None, tabular_file=None, index_col=None)

Bases: Dataset, MedicalFolderBase

Torch dataset following the Medical Folder Structure.

The Medical Folder structure is loosely inspired by the BIDS standard [1]. It should respect the following pattern:

└─ MedicalFolder_root/
    └─ demographics.csv
    └─ sub-01/
        ├─ T1/
        │  └─ sub-01_xxx.nii.gz
        └─ T2/
            ├─ sub-01_xxx.nii.gz
where the first-level subfolders or the root correspond to the subjects, and each subject's folder contains subfolders for each imaging modality. Images should be in Nifti format, with either the .nii or .nii.gz extensions. Finally, within the root folder there should also be a demographics file containing at least one index column with the names of the subject folders. This column will be used to explore the data and load the images. The demographics file may contain additional information about each subject and will be loaded alongside the images by our framework.

[1] https://bids.neuroimaging.io/

Parameters:

Name Type Description Default
root Union[str, PathLike, Path]

Root folder containing all the subject directories.

required
data_modalities (str, Iterable)

Modality or modalities to be used as data sources.

'T1'
transform Union[Callable, Dict[str, Callable]]

A function or dict of function transform(s) that preprocess each data source.

None
target_modalities Optional[Union[str, Iterable[str]]]

(str, Iterable): Modality or modalities to be used as target sources.

'label'
target_transform Union[Callable, Dict[str, Callable]]

A function or dict of function transform(s) that preprocess each target source.

None
demographics_transform Optional[Callable]

TODO

None
tabular_file Union[str, PathLike, Path, None]

Path to a CSV or Excel file containing the demographic information from the patients.

None
index_col Union[int, str, None]

Column name in the tabular file containing the subject ids which mush match the folder names.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self,
             root: Union[str, PathLike, Path],
             data_modalities: Optional[Union[str, Iterable[str]]] = 'T1',
             transform: Union[Callable, Dict[str, Callable]] = None,
             target_modalities: Optional[Union[str, Iterable[str]]] = 'label',
             target_transform: Union[Callable, Dict[str, Callable]] = None,
             demographics_transform: Optional[Callable] = None,
             tabular_file: Union[str, PathLike, Path, None] = None,
             index_col: Union[int, str, None] = None,
             ):
    """Constructor for class `MedicalFolderDataset`.

    Args:
        root: Root folder containing all the subject directories.
        data_modalities (str, Iterable): Modality or modalities to be used as data sources.
        transform: A function or dict of function transform(s) that preprocess each data source.
        target_modalities: (str, Iterable): Modality or modalities to be used as target sources.
        target_transform: A function or dict of function transform(s) that preprocess each target source.
        demographics_transform: TODO
        tabular_file: Path to a CSV or Excel file containing the demographic information from the patients.
        index_col: Column name in the tabular file containing the subject ids which mush match the folder names.
    """
    super(MedicalFolderDataset, self).__init__(root=root)

    self._tabular_file = tabular_file
    self._index_col = index_col

    self._data_modalities = [data_modalities] if isinstance(data_modalities, str) else data_modalities
    self._target_modalities = [target_modalities] if isinstance(target_modalities, str) else target_modalities

    self._transform = self._check_and_reformat_transforms(transform, data_modalities)
    self._target_transform = self._check_and_reformat_transforms(target_transform, target_modalities)
    self._demographics_transform = demographics_transform if demographics_transform is not None else lambda x: {}

    # Image loader
    self._reader = Compose([
        LoadImage(ITKReader(), image_only=True),
        ToTensor()
    ])

Attributes

Functions

MedicalFolderLoadingBlockTypes

Bases: DataLoadingBlockTypes, Enum

Attributes

NIFTIFolderDataset

NIFTIFolderDataset(root, transform=None, target_transform=None)

Bases: Dataset

A Generic class for loading NIFTI Images using the folder structure as the target classes' labels.

Supported formats: - NIFTI and compressed NIFTI files: .nii, .nii.gz

This is a Dataset useful in classification tasks. Its usage is quite simple, quite similar to torchvision.datasets.ImageFolder. Images must be contained in first level sub-folders (level 2+ sub-folders are ignored) that describe the target class they belong to (target class label is the name of the folder).

nifti_dataset_root_folder
├── control_group
│   ├── subject_1.nii
│   └── subject_2.nii
│   └── ...
└── disease_group
    ├── subject_3.nii
    └── subject_4.nii
    └── ...

In this example, there are 4 samples (one from each *.nii file), 2 target class, with labels control_group and disease_group. subject_1.nii has class label control_group, subject_3.nii has class label disease_group,etc.

Parameters:

Name Type Description Default
root Union[str, PathLike, Path]

folder where the data is located.

required
transform Union[Callable, None]

transforms to be applied on data.

None
target_transform Union[Callable, None]

transforms to be applied on target indexes.

None

Raises:

Type Description
FedbiomedDatasetError

bad argument type

FedbiomedDatasetError

bad root path

Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, PathLike, Path],
             transform: Union[Callable, None] = None,
             target_transform: Union[Callable, None] = None
             ):
    """Constructor of the class

    Args:
        root: folder where the data is located.
        transform: transforms to be applied on data.
        target_transform: transforms to be applied on target indexes.

    Raises:
        FedbiomedDatasetError: bad argument type
        FedbiomedDatasetError: bad root path
    """
    # check parameters type
    for tr, trname in ((transform, 'transform'), (target_transform, 'target_transform')):
        if not callable(tr) and tr is not None:
            raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter {trname} has incorrect "
                                        f"type {type(tr)}, cannot create dataset.")

    if not isinstance(root, str) and not isinstance(root, PathLike) and not isinstance(root, Path):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter `root` has incorrect type "
                                    f"{type(root)}, cannot create dataset.")

    # initialize object variables
    self._files = []
    self._class_labels = []
    self._targets = []

    try:
        self._root_dir = Path(root).expanduser()
    except RuntimeError as e:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB612.value}: Cannot expand path {root}, error message is: {e}")

    self._transform = transform
    self._target_transform = target_transform
    self._reader = Compose([
        LoadImage(ITKReader(), image_only=True),
        ToTensor()
    ])

    self._explore_root_folder()

Functions

NPDataLoader

NPDataLoader(dataset, target, batch_size=1, shuffle=False, random_seed=None, drop_last=False)

DataLoader for a Numpy dataset.

This data loader encapsulates a dataset composed of numpy arrays and presents an Iterable interface. One design principle was to try to make the interface as similar as possible to a torch.DataLoader.

Attributes:

Name Type Description
_dataset

(np.ndarray) a 2d array of features

_target

(np.ndarray) an optional array of target values

_batch_size

(int) the number of elements in one batch

_shuffle

(bool) if True, shuffle the data at the beginning of every epoch

_drop_last

(bool) if True, drop the last batch if it does not contain batch_size elements

_rng

(np.random.Generator) the random number generator for shuffling

Parameters:

Name Type Description Default
dataset ndarray

2D Numpy array

required
target ndarray

Numpy array of target values

required
batch_size int

batch size for each iteration

1
shuffle bool

shuffle before iteration

False
random_seed Optional[int]

an optional integer to set the numpy random seed for shuffling. If it equals None, then no attempt will be made to set the random seed.

None
drop_last bool

whether to drop the last batch in case it does not fill the whole batch size

False
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
             dataset: np.ndarray,
             target: np.ndarray,
             batch_size: int = 1,
             shuffle: bool = False,
             random_seed: Optional[int] = None,
             drop_last: bool = False):
    """Construct numpy data loader

    Args:
        dataset: 2D Numpy array
        target: Numpy array of target values
        batch_size: batch size for each iteration
        shuffle: shuffle before iteration
        random_seed: an optional integer to set the numpy random seed for shuffling. If it equals
            None, then no attempt will be made to set the random seed.
        drop_last: whether to drop the last batch in case it does not fill the whole batch size
    """

    if not isinstance(dataset, np.ndarray) or not isinstance(target, np.ndarray):
        msg = f"{ErrorNumbers.FB609.value}. Wrong input type for `dataset` or `target` in NPDataLoader. " \
              f"Expected type np.ndarray for both, instead got {type(dataset)} and" \
              f"{type(target)} respectively."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # If the researcher gave a 1-dimensional dataset, we expand it to 2 dimensions
    if dataset.ndim == 1:
        dataset = dataset[:, np.newaxis]

    # If the researcher gave a 1-dimensional target, we expand it to 2 dimensions
    if target.ndim == 1:
        target = target[:, np.newaxis]

    if dataset.ndim != 2 or target.ndim != 2:
        msg = f"{ErrorNumbers.FB609.value}. Wrong shape for `dataset` or `target` in NPDataLoader. " \
              f"Expected 2-dimensional arrays, instead got {dataset.ndim}-dimensional " \
              f"and {target.ndim}-dimensional arrays respectively."
        logger.error(msg)
        raise FedbiomedValueError(msg)

    if len(dataset) != len(target):
        msg = f"{ErrorNumbers.FB609.value}. Inconsistent length for `dataset` and `target` in NPDataLoader. " \
              f"Expected same length, instead got len(dataset)={len(dataset)}, len(target)={len(target)}"
        logger.error(msg)
        raise FedbiomedValueError(msg)

    if not isinstance(batch_size, int):
        msg = f"{ErrorNumbers.FB609.value}. Wrong type for `batch_size` parameter of NPDataLoader. Expected a " \
              f"non-zero positive integer, instead got type {type(batch_size)}."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    if batch_size <= 0:
        msg = f"{ErrorNumbers.FB609.value}. Wrong value for `batch_size` parameter of NPDataLoader. Expected a " \
              f"non-zero positive integer, instead got value {batch_size}."
        logger.error(msg)
        raise FedbiomedValueError(msg)

    if not isinstance(shuffle, bool):
        msg = f"{ErrorNumbers.FB609.value}. Wrong type for `shuffle` parameter of NPDataLoader. Expected `bool`, " \
              f"instead got {type(shuffle)}."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    if not isinstance(drop_last, bool):
        msg = f"{ErrorNumbers.FB609.value}. Wrong type for `drop_last` parameter of NPDataLoader. " \
              f"Expected `bool`, instead got {type(drop_last)}."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    if random_seed is not None and not isinstance(random_seed, int):
        msg = f"{ErrorNumbers.FB609.value}. Wrong type for `random_seed` parameter of NPDataLoader. " \
              f"Expected int or None, instead got {type(random_seed)}."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    self._dataset = dataset
    self._target = target
    self._batch_size = batch_size
    self._shuffle = shuffle
    self._drop_last = drop_last
    self._rng = np.random.default_rng(random_seed)

Attributes

Functions

SerializationValidation

SerializationValidation()

Provide Validation capabilities for serializing/deserializing a [DataLoadingBlock] or [DataLoadingPlan].

When a developer inherits from [DataLoadingBlock] to define a custom loading block, they are required to call the _serialization_validator.update_validation_scheme function with a dictionary argument containing the rules to validate all the additional fields that will be used in the serialization of their loading block.

These rules must follow the syntax explained in the SchemeValidator class.

For example

    class MyLoadingBlock(DataLoadingBlock):
        def __init__(self):
            self.my_custom_data = {}
            self._serialization_validator.update_validation_scheme({
                'custom_data': {
                    'rules': [dict, ...any other rules],
                    'required': True
                }
            })
        def serialize(self):
            serialized = super().serialize()
            serialized.update({'custom_data': self.my_custom_data})
            return serialized

Attributes:

Name Type Description
_validation_scheme

(dict) an extensible set of rules to validate the DataLoadingBlock metadata.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self._validation_scheme = {}

Functions

SkLearnDataManager

SkLearnDataManager(inputs, target, **kwargs)

Bases: object

Wrapper for pd.DataFrame, pd.Series and np.ndarray datasets.

Manages datasets for scikit-learn based model training. Responsible for managing inputs, and target variables that have been provided in training_data of scikit-learn based training plans.

The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the [fedbiomed.common.data.NPDataLoader] class for more details.

Parameters:

Name Type Description Default
inputs Union[ndarray, DataFrame, Series]

Independent variables (inputs, features) for model training

required
target Union[ndarray, DataFrame, Series]

Dependent variable/s (target) for model training and validation

required
**kwargs dict

Loader arguments

{}
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
             inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
             target: Union[np.ndarray, pd.DataFrame, pd.Series],
             **kwargs: dict):

    """ Construct a SkLearnDataManager from an array of inputs and an array of targets.

    The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated
    when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the
    [fedbiomed.common.data.NPDataLoader] class for more details.

    Args:
        inputs: Independent variables (inputs, features) for model training
        target: Dependent variable/s (target) for model training and validation
        **kwargs: Loader arguments
    """

    if not isinstance(inputs, (np.ndarray, pd.DataFrame, pd.Series)) or \
            not isinstance(target, (np.ndarray, pd.DataFrame, pd.Series)):
        msg = f"{ErrorNumbers.FB609.value}. Parameters `inputs` and `target` for " \
              f"initialization of {self.__class__.__name__} should be one of np.ndarray, pd.DataFrame, pd.Series"
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # Convert pd.DataFrame or pd.Series to np.ndarray for `inputs`
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self._inputs = inputs.to_numpy()
    else:
        self._inputs = inputs

    # Convert pd.DataFrame or pd.Series to np.ndarray for `target`
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self._target = target.to_numpy()
    else:
        self._target = target

    # Additional loader arguments
    self._loader_arguments = kwargs

    # Subset None means that train/validation split has not been performed
    self._subset_test: Union[Tuple[np.ndarray, np.ndarray], None] = None
    self._subset_train: Union[Tuple[np.ndarray, np.ndarray], None] = None

Functions

TabularDataset

TabularDataset(inputs, target)

Bases: Dataset

Torch based Dataset object to create torch Dataset from given numpy or dataframe type of input and target variables

Parameters:

Name Type Description Default
inputs Union[ndarray, DataFrame, Series]

Input variables that will be passed to network

required
target Union[ndarray, DataFrame, Series]

Target variable for output layer

required

Raises:

Type Description
FedbiomedTorchDatasetError

If input variables and target variable does not have equal length/size

Source code in fedbiomed/common/data/_tabular_dataset.py
def __init__(self,
             inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
             target: Union[np.ndarray, pd.DataFrame, pd.Series]):
    """Constructs PyTorch dataset object

    Args:
        inputs: Input variables that will be passed to network
        target: Target variable for output layer

    Raises:
        FedbiomedTorchDatasetError: If input variables and target variable does not have
            equal length/size
    """

    # Inputs and target variable should be converted to the torch tensors
    # PyTorch provides `from_numpy` function to convert numpy arrays to
    # torch tensor. Therefore, if the arguments `inputs` and `target` are
    # instance one of `pd.DataFrame` or `pd.Series`, they should be converted to
    # numpy arrays
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self.inputs = inputs.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.inputs = inputs
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `inputs` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
    # Configuring self.target attribute
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self.target = target.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.target = target
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `target` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")

    # The lengths should be equal
    if len(self.inputs) != len(self.target):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: Length of input variables and target "
                                                f"variable does not match. Please make sure that they have "
                                                f"equal size while creating the method `training_data` of "
                                                f"TrainingPlan")

    # Convert `inputs` adn `target` to Torch floats
    self.inputs = from_numpy(self.inputs).float()
    self.target = from_numpy(self.target).float()

Attributes

Functions

TorchDataManager

TorchDataManager(dataset, **kwargs)

Bases: object

Wrapper for PyTorch Dataset to manage loading operations for validation and train.

Parameters:

Name Type Description Default
dataset Dataset

Dataset object for torch.utils.data.DataLoader

required
**kwargs dict

Arguments for PyTorch DataLoader

{}

Raises:

Type Description
FedbiomedTorchDataManagerError

If the argument dataset is not an instance of torch.utils.data.Dataset

Source code in fedbiomed/common/data/_torch_data_manager.py
def __init__(self, dataset: Dataset, **kwargs: dict):
    """Construct  of class

    Args:
        dataset: Dataset object for torch.utils.data.DataLoader
        **kwargs: Arguments for PyTorch `DataLoader`

    Raises:
        FedbiomedTorchDataManagerError: If the argument `dataset` is not an instance of `torch.utils.data.Dataset`
    """

    # TorchDataManager should get `dataset` argument as an instance of torch.utils.data.Dataset
    if not isinstance(dataset, Dataset):
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: The attribute `dataset` should an instance "
            f"of `torch.utils.data.Dataset`, please use `Dataset` as parent class for"
            f"your custom torch dataset object")

    self._dataset = dataset
    self._loader_arguments = kwargs
    self._subset_test: Union[Subset, None] = None
    self._subset_train: Union[Subset, None] = None

Attributes

Functions

Functions

discover_flamby_datasets

discover_flamby_datasets()

Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.

Returns:

Name Type Description
Dict[int, str]

a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module

Dict[int, str]

corresponding to a dataset, represented as str. To import said module one must prepend with the correct

path Dict[int, str]

import flamby.datasets.dataset_name.

Source code in fedbiomed/common/data/_flamby_dataset.py
def discover_flamby_datasets() -> Dict[int, str]:
    """Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.

    Returns:
        a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module
        corresponding to a dataset, represented as str. To import said module one must prepend with the correct
        path: `import flamby.datasets.dataset_name`.

    """
    dataset_list = [name for _, name, ispkg in pkgutil.iter_modules(flamby_datasets_module.__path__) if ispkg]
    return {i: name for i, name in enumerate(dataset_list)}