Data

Classes that simplify imports from fedbiomed.common.data

Classes

DataLoadingBlock

DataLoadingBlock()

Bases: ABC

The building blocks of a DataLoadingPlan.

A DataLoadingBlock describes an intermediary layer between the researcher and the node's filesystem. It allows the node to specify a customization in the way data is "perceived" by the data loaders during training.

A DataLoadingBlock is identified by its type_id attribute. Thus, this attribute should be unique among all DataLoadingBlockTypes in the same DataLoadingPlan. Moreover, we may test equality between a DataLoadingBlock and a string by checking its type_id, as a means of easily testing whether a DataLoadingBlock is contained in a collection.

Correct usage of this class requires creating ad-hoc subclasses. The DataLoadingBlock class is not intended to be instantiated directly.

Subclasses of DataLoadingBlock must respect the following conditions:

  1. implement a default constructor
  2. the implemented constructor must call super().__init__()
  3. extend the serialize(self) and the deserialize(self, load_from: dict) functions
  4. both serialize and deserialize must call super's serialize and deserialize respectively
  5. the deserialize function must always return self
  6. the serialize function must update the dict returned by super's serialize
  7. implement an apply function that takes arbitrary arguments and applies the logic of the loading_block
  8. update the _validation_scheme to define rules for all new fields returned by the serialize function

Attributes:

Name Type Description
__serialization_id

(str) identifies one serialized instance of the DataLoadingBlock

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self.__serialization_id = 'serialized_dlb_' + str(uuid.uuid4())
    self._serialization_validator = SerializationValidation()
    self._serialization_validator.update_validation_scheme(SerializationValidation.dlb_default_scheme())

Functions

apply abstractmethod
apply(*args, **kwargs)

Abstract method representing an application of the DataLoadingBlock

Source code in fedbiomed/common/data/_data_loading_plan.py
@abstractmethod
def apply(self, *args, **kwargs):
    """Abstract method representing an application of the DataLoadingBlock
    """
    pass
deserialize
deserialize(load_from)

Reconstruct the DataLoadingBlock from a serialized version.

Parameters:

Name Type Description Default
load_from dict

a dictionary as obtained by the serialize function.

required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, load_from: dict) -> TDataLoadingBlock:
    """Reconstruct the DataLoadingBlock from a serialized version.

    Args:
        load_from (dict): a dictionary as obtained by the serialize function.
    Returns:
        the self instance
    """
    self._serialization_validator.validate(load_from, FedbiomedLoadingBlockValueError)
    self.__serialization_id = load_from['dlb_id']
    return self
get_serialization_id
get_serialization_id()

Expose serialization id as read-only

Source code in fedbiomed/common/data/_data_loading_plan.py
def get_serialization_id(self):
    """Expose serialization id as read-only"""
    return self.__serialization_id
instantiate_class staticmethod
instantiate_class(loading_block)

Instantiate one DataLoadingBlock object of the type defined in the arguments.

Uses the loading_block_module and loading_block_class fields of the loading_block argument to identify the type of DataLoadingBlock to be instantiated, then calls its default constructor. Note that this function does not call deserialize.

Parameters:

Name Type Description Default
loading_block dict

DataLoadingBlock metadata in the format returned by the serialize function.

required

Returns: A default-constructed instance of a DataLoadingBlock of the type defined in the metadata. Raises: FedbiomedLoadingBlockError: if the instantiation process raised any exception.

Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def instantiate_class(loading_block: dict) -> TDataLoadingBlock:
    """Instantiate one [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    object of the type defined in the arguments.

    Uses the `loading_block_module` and `loading_block_class` fields of the loading_block argument to
    identify the type of [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    to be instantiated, then calls its default constructor.
    Note that this function **does not call deserialize**.

    Args:
        loading_block (dict): [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
            metadata in the format returned by the serialize function.
    Returns:
        A default-constructed instance of a
            [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
            of the type defined in the metadata.
    Raises:
       FedbiomedLoadingBlockError: if the instantiation process raised any exception.
    """
    try:
        dlb_module = import_module(loading_block['loading_block_module'])
        dlb = eval(f"dlb_module.{loading_block['loading_block_class']}()")
    except Exception as e:
        msg = f"{ErrorNumbers.FB614.value}: could not instantiate DataLoadingBlock from the following metadata: " +\
              f"{loading_block} because of {type(e).__name__}: {e}"
        logger.debug(msg)
        raise FedbiomedLoadingBlockError(msg)
    return dlb
instantiate_key staticmethod
instantiate_key(key_module, key_classname, loading_block_key_str)

Imports and loads DataLoadingBlockTypes regarding the passed arguments

Parameters:

Name Type Description Default
key_module str

description

required
key_classname str

description

required
loading_block_key_str str

description

required

Raises:

Type Description
FedbiomedDataLoadingPlanError

description

Returns:

Name Type Description
DataLoadingBlockTypes DataLoadingBlockTypes

description

Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def instantiate_key(key_module: str, key_classname: str, loading_block_key_str: str) -> DataLoadingBlockTypes:
    """Imports and loads [DataLoadingBlockTypes][fedbiomed.common.constants.DataLoadingBlockTypes]
    regarding the passed arguments

    Args:
        key_module (str): _description_
        key_classname (str): _description_
        loading_block_key_str (str): _description_

    Raises:
        FedbiomedDataLoadingPlanError: _description_

    Returns:
        DataLoadingBlockTypes: _description_
    """
    try:
        keys = import_module(key_module)
        loading_block_key = eval(f"keys.{key_classname}('{loading_block_key_str}')")
    except Exception as e:
        msg = f"{ErrorNumbers.FB615.value} Error deserializing loading block key " + \
              f"{loading_block_key_str} with path {key_module}.{key_classname} " + \
              f"because of {type(e).__name__}: {e}"
        logger.debug(msg)
        raise FedbiomedDataLoadingPlanError(msg)
    return loading_block_key
serialize
serialize()

Serializes the class in a format similar to json.

Returns:

Type Description
dict

a dictionary of key-value pairs sufficient for reconstructing

dict

the DataLoadingBlock.

Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> dict:
    """Serializes the class in a format similar to json.

    Returns:
        a dictionary of key-value pairs sufficient for reconstructing
        the DataLoadingBlock.
    """
    return dict(
        loading_block_class=self.__class__.__qualname__,
        loading_block_module=self.__module__,
        dlb_id=self.__serialization_id
    )

DataLoadingPlan

DataLoadingPlan(*args, **kwargs)

Bases: Dict[DataLoadingBlockTypes, DataLoadingBlock]

Customizations to the way the data is loaded and presented for training.

A DataLoadingPlan is a dictionary of {name: DataLoadingBlock} pairs. Each DataLoadingBlock represents a customization to the way data is loaded and presented to the researcher. These customizations are defined by the node, but they operate on a Dataset class, which is defined by the library and instantiated by the researcher.

To exploit this functionality, a Dataset must be modified to accept the customizations provided by the DataLoadingPlan. To simplify this process, we provide the DataLoadingPlanMixin class below.

The DataLoadingPlan class should be instantiated directly, no subclassing is needed. The DataLoadingPlan is a dict, and exposes the same interface as a dict.

Attributes:

Name Type Description
dlp_id

str representing a unique plan id (auto-generated)

desc

str representing an optional user-friendly short description

target_dataset_type

a DatasetTypes enum representing the type of dataset targeted by this DataLoadingPlan

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self, *args, **kwargs):
    super(DataLoadingPlan, self).__init__(*args, **kwargs)
    self.dlp_id = 'dlp_' + str(uuid.uuid4())
    self.desc = ""
    self.target_dataset_type = DatasetTypes.NONE
    self._serialization_validation = SerializationValidation()
    self._serialization_validation.update_validation_scheme(SerializationValidation.dlp_default_scheme())

Attributes

desc instance-attribute
desc = ''
dlp_id instance-attribute
dlp_id = 'dlp_' + str(uuid4())
target_dataset_type instance-attribute
target_dataset_type = NONE

Functions

deserialize
deserialize(serialized_dlp, serialized_loading_blocks)

Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.

Calling this function will clear the contained [DataLoadingBlockTypes].

This function may not be used to "update" nor to "append to" a DataLoadingPlan.

Parameters:

Name Type Description Default
serialized_dlp dict

a dictionary of data loading plan metadata, as obtained from the first output of the serialize function

required
serialized_loading_blocks List[dict]

a list of dictionaries of loading_block metadata, as obtained from the second output of the serialize function

required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, serialized_dlp: dict, serialized_loading_blocks: List[dict]) -> TDataLoadingPlan:
    """Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.

    !!! warning "Calling this function will *clear* the contained [DataLoadingBlockTypes]."
        This function may not be used to "update" nor to "append to"
        a [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan].

    Args:
        serialized_dlp: a dictionary of data loading plan metadata, as obtained from the first output of the
            serialize function
        serialized_loading_blocks: a list of dictionaries of loading_block metadata, as obtained from the
            second output of the serialize function
    Returns:
        the self instance
    """
    self._serialization_validation.validate(serialized_dlp, FedbiomedDataLoadingPlanValueError)

    self.clear()
    self.dlp_id = serialized_dlp['dlp_id']
    self.desc = serialized_dlp['dlp_name']
    self.target_dataset_type = DatasetTypes(serialized_dlp['target_dataset_type'])
    for loading_block_key_str, dlb_id in serialized_dlp['loading_blocks'].items():
        key_module, key_classname = serialized_dlp['key_paths'][loading_block_key_str]
        loading_block_key = DataLoadingBlock.instantiate_key(key_module, key_classname, loading_block_key_str)
        loading_block = next(filter(lambda x: x['dlb_id'] == dlb_id,
                                    serialized_loading_blocks))
        dlb = DataLoadingBlock.instantiate_class(loading_block)
        self[loading_block_key] = dlb.deserialize(loading_block)
    return self
infer_dataset_type staticmethod
infer_dataset_type(dataset)

Infer the type of a given dataset.

This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to apply some heuristics to guess the type of dataset.

Parameters:

Name Type Description Default
dataset Any

the dataset whose type we want to infer.

required

Returns: a DatasetTypes enum element which identifies the type of the dataset. Raises: FedbiomedDataLoadingPlanValueError: if the dataset does not have a get_dataset_type method and moreover the type could not be guessed.

Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def infer_dataset_type(dataset: Any) -> DatasetTypes:
    """Infer the type of a given dataset.

    This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes
    the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to
    apply some heuristics to guess the type of dataset.

    Args:
        dataset: the dataset whose type we want to infer.
    Returns:
        a DatasetTypes enum element which identifies the type of the dataset.
    Raises:
        FedbiomedDataLoadingPlanValueError: if the dataset does not have a `get_dataset_type` method and moreover
            the type could not be guessed.
    """
    if hasattr(dataset, 'get_dataset_type'):
        return dataset.get_dataset_type()
    elif dataset.__class__.__name__ == 'ImageFolder':
        # ImageFolder could be both an images type or mednist. Try to identify mednist with some heuristic.
        if hasattr(dataset, 'classes') and \
                all([x in dataset.classes for x in ['AbdomenCT', 'BreastMRI', 'CXR', 'ChestCT', 'Hand', 'HeadCT']]):
            return DatasetTypes.MEDNIST
        else:
            return DatasetTypes.IMAGES
    elif dataset.__class__.__name__ == 'MNIST':
        return DatasetTypes.DEFAULT
    msg = f"{ErrorNumbers.FB615.value} Trying to infer dataset type of {dataset} is not supported " + \
        f"for datasets of type {dataset.__class__.__qualname__}"
    logger.debug(msg)
    raise FedbiomedDataLoadingPlanValueError(msg)
serialize
serialize()

Serializes the class in a format similar to json.

Returns:

Type Description
Tuple[dict, List]

a tuple sufficient for reconstructing the DataLoading plan. It includes: - a dictionary of key-value pairs with the DataLoadingPlan parameters. - a list of dict containing the data for reconstruction all the DataLoadingBlock of the DataLoadingPlan

Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> Tuple[dict, List]:
    """Serializes the class in a format similar to json.

    Returns:
        a tuple sufficient for reconstructing the DataLoading plan. It includes:
            - a dictionary of key-value pairs with the
            [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] parameters.
            - a list of dict containing the data for reconstruction all the DataLoadingBlock
                of the [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] 
    """
    return dict(
        dlp_id=self.dlp_id,
        dlp_name=self.desc,
        target_dataset_type=self.target_dataset_type.value,
        loading_blocks={key.value: dlb.get_serialization_id() for key, dlb in self.items()},
        key_paths={key.value: (f"{key.__module__}", f"{key.__class__.__qualname__}") for key in self.keys()}
    ), [dlb.serialize() for dlb in self.values()]

DataLoadingPlanMixin

DataLoadingPlanMixin()

Utility class to enable DLP functionality in a dataset.

Any Dataset class that inherits from [DataLoadingPlanMixin] will have the basic tools necessary to support a DataLoadingPlan. Typically, the logic of each specific DataLoadingBlock in the DataLoadingPlan will be implemented in the form of hooks that are called within the Dataset's implementation using the helper function apply_dlb defined below.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self._dlp = None

Functions

apply_dlb
apply_dlb(default_ret_value, dlb_key, *args, **kwargs)

Apply one DataLoadingBlock identified by its key.

Note that we want to easily support the case where the DataLoadingPlan is not activated, or the requested loading block is not contained in the DataLoadingPlan. This is achieved by providing a default return value to be returned when the above conditions are met. Hence, most of the calls to apply_dlb will look like this:

value = self.apply_dlb(value, 'my-loading-block', my_apply_args)
This will ensure that value is not changed if the DataLoadingPlan is not active.

Parameters:

Name Type Description Default
default_ret_value Any

the value to be returned in case that the dlp functionality is not required

required
dlb_key DataLoadingBlockTypes

the key of the DataLoadingBlock to be applied

required
*args Optional[Any]

forwarded to the DataLoadingBlock's apply function

()
**kwargs Optional[Any]

forwarded to the DataLoadingBlock's apply function

{}

Returns: the output of the DataLoadingBlock's apply function, or the default_ret_value when dlp is None or it does not contain the requested loading block

Source code in fedbiomed/common/data/_data_loading_plan.py
def apply_dlb(self, default_ret_value: Any, dlb_key: DataLoadingBlockTypes,
              *args: Optional[Any], **kwargs: Optional[Any]) -> Any:
    """Apply one DataLoadingBlock identified by its key.

    Note that we want to easily support the case where the DataLoadingPlan
    is not activated, or the requested loading block is not contained in the
    DataLoadingPlan. This is achieved by providing a default return value
    to be returned when the above conditions are met. Hence, most of the
    calls to apply_dlb will look like this:
    ```
    value = self.apply_dlb(value, 'my-loading-block', my_apply_args)
    ```
    This will ensure that value is not changed if the DataLoadingPlan is
    not active.

    Args:
        default_ret_value: the value to be returned in case that the dlp
            functionality is not required
        dlb_key: the key of the DataLoadingBlock to be applied
        *args: forwarded to the DataLoadingBlock's apply function
        **kwargs: forwarded to the DataLoadingBlock's apply function
    Returns:
        the output of the DataLoadingBlock's apply function, or
            the default_ret_value when dlp is None or it does not contain
            the requested loading block
    """
    if not isinstance(dlb_key, DataLoadingBlockTypes):
        raise FedbiomedDataLoadingPlanValueError(f"Key {dlb_key} is not of enum type DataLoadingBlockTypes"
                                                 f" in DataLoadingPlanMixin.apply_dlb")
    if self._dlp is not None and dlb_key in self._dlp:
        return self._dlp[dlb_key].apply(*args, **kwargs)
    else:
        return default_ret_value
clear_dlp
clear_dlp()
Source code in fedbiomed/common/data/_data_loading_plan.py
def clear_dlp(self):
    self._dlp = None
set_dlp
set_dlp(dlp)

Sets the dlp if the target dataset type is appropriate

Source code in fedbiomed/common/data/_data_loading_plan.py
def set_dlp(self, dlp: DataLoadingPlan):
    """Sets the dlp if the target dataset type is appropriate"""
    if not isinstance(dlp, DataLoadingPlan):
        msg = f"{ErrorNumbers.FB615.value} Trying to set a DataLoadingPlan but the argument is of type " + \
              f"{type(dlp).__name__}"
        logger.debug(msg)
        raise FedbiomedDataLoadingPlanValueError(msg)

    dataset_type = DataLoadingPlan.infer_dataset_type(self)  # `self` here will refer to the Dataset instance
    if dlp.target_dataset_type != DatasetTypes.NONE and dataset_type != dlp.target_dataset_type:
        raise FedbiomedDataLoadingPlanValueError(f"Trying to set {dlp} on dataset of type {dataset_type.value} but "
                                                 f"the target type is {dlp.target_dataset_type}")
    elif dlp.target_dataset_type == DatasetTypes.NONE:
        dlp.target_dataset_type = dataset_type
    self._dlp = dlp

DataManager

DataManager(dataset, target=None, **kwargs)

Bases: object

Factory class that build different data loader/datasets based on the type of dataset. The argument dataset should be provided as torch.utils.data.Dataset object for to be used in PyTorch training.

Parameters:

Name Type Description Default
dataset Union[ndarray, DataFrame, Series, Dataset]

Dataset object. It can be an instance, PyTorch Dataset or Tuple.

required
target Union[ndarray, DataFrame, Series]

Target variable or variables.

None
**kwargs dict

Additional parameters that are going to be used for data loader

{}
Source code in fedbiomed/common/data/_data_manager.py
def __init__(self,
             dataset: Union[np.ndarray, pd.DataFrame, pd.Series, Dataset],
             target: Union[np.ndarray, pd.DataFrame, pd.Series] = None,
             **kwargs: dict) -> None:

    """Constructor of DataManager,

    Args:
        dataset: Dataset object. It can be an instance, PyTorch Dataset or Tuple.
        target: Target variable or variables.
        **kwargs: Additional parameters that are going to be used for data loader
    """

    # TODO: Improve datamanager for auto loading by given dataset_path and other information
    # such as inputs variable indexes and target variables indexes

    self._dataset = dataset
    self._target = target
    self._loader_arguments: Dict = kwargs
    self._data_manager_instance = None

Functions

extend_loader_args
extend_loader_args(extension)

Extends the class' loader arguments

Extends the class's _loader_arguments attribute with additional key-values from the extension argument. If a key already exists in the _loader_arguments, then it is not replaced.

Parameters:

Name Type Description Default
extension Dict

the mapping used to extend the loader arguments

required
Source code in fedbiomed/common/data/_data_manager.py
def extend_loader_args(self, extension: Dict):
    """Extends the class' loader arguments

    Extends the class's `_loader_arguments` attribute with additional key-values from
    the `extension` argument. If a key already exists in the `_loader_arguments`, then
    it is not replaced.

    Args:
        extension: the mapping used to extend the loader arguments
    """
    self._loader_arguments.update(
        {key: value for key, value in extension.items() if key not in self._loader_arguments}
    )
load
load(tp_type)

Loads proper DataManager based on given TrainingPlan and dataset, target attributes.

Parameters:

Name Type Description Default
tp_type TrainingPlans

Enumeration instance of TrainingPlans that stands for type of training plan.

required

Raises:

Type Description
FedbiomedDataManagerError

If requested DataManager does not match with given arguments.

Source code in fedbiomed/common/data/_data_manager.py
def load(self, tp_type: TrainingPlans):
    """Loads proper DataManager based on given TrainingPlan and
    `dataset`, `target` attributes.

    Args:
        tp_type: Enumeration instance of TrainingPlans that stands for type of training plan.

    Raises:
        FedbiomedDataManagerError: If requested DataManager does not match with given arguments.

    """

    # Training plan is type of TorcTrainingPlan
    if tp_type == TrainingPlans.TorchTrainingPlan:
        if self._target is None and isinstance(self._dataset, Dataset):
            # Create Dataset for pytorch
            self._data_manager_instance = TorchDataManager(dataset=self._dataset, **self._loader_arguments)
        elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
                isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
            # If `dataset` and `target` attributes are array-like object
            # create TabularDataset object to instantiate a TorchDataManager
            torch_dataset = TabularDataset(inputs=self._dataset, target=self._target)
            self._data_manager_instance = TorchDataManager(dataset=torch_dataset, **self._loader_arguments)
        else:
            raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Invalid arguments for torch based "
                                            f"training plan, either provide the argument  `dataset` as PyTorch "
                                            f"Dataset instance, or provide `dataset` and `target` arguments as "
                                            f"an instance one of pd.DataFrame, pd.Series or np.ndarray ")

    elif tp_type == TrainingPlans.SkLearnTrainingPlan:
        # Try to convert `torch.utils.Data.Dataset` to SkLearnBased dataset/datamanager
        if self._target is None and isinstance(self._dataset, Dataset):
            torch_data_manager = TorchDataManager(dataset=self._dataset)
            try:
                self._data_manager_instance = torch_data_manager.to_sklearn()
            except Exception as e:
                raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: PyTorch based `Dataset` object "
                                                "has been instantiated with DataManager. An error occurred while"
                                                "trying to convert torch.utils.data.Dataset to numpy based "
                                                f"dataset: {str(e)}")

        # For scikit-learn based training plans, the arguments `dataset` and `target` should be an instance
        # one of `pd.DataFrame`, `pd.Series`, `np.ndarray`
        elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
                isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
            # Create Dataset for SkLearn training plans
            self._data_manager_instance = SkLearnDataManager(inputs=self._dataset, target=self._target,
                                                             **self._loader_arguments)
        else:
            raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: The argument `dataset` and `target` "
                                            f"should be instance of pd.DataFrame, pd.Series or np.ndarray ")
    else:
        raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Undefined training plan")

FlambyDataset

FlambyDataset()

Bases: DataLoadingPlanMixin, Dataset

A federated Flamby dataset.

A FlambyDataset is a wrapper around a flamby FedClass instance, adding functionalities and interfaces that are specific to Fed-BioMed.

A FlambyDataset is always created in an empty state, and it requires a DataLoadingPlan to be finalized to a correct state. The DataLoadingPlan must contain at least the following DataLoadinBlock key-value pair: - FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA : FlambyDatasetMetadataBlock

The lifecycle of the DataLoadingPlan and the wrapped FedClass are tightly interlinked: when the DataLoadingPlan is set, the wrapped FedClass is initialized and instantiated. When the DataLoadingPlan is cleared, the wrapped FedClass is also cleared. Hence, an invariant of this class is that the self._dlp and self.__flamby_fed_class should always be either both None, or both set to some value.

Attributes:

Name Type Description
_transform

a transform function of type MonaiTransform or TorchTransform that will be applied to every sample when data is loaded.

__flamby_fed_class

a private instance of the wrapped Flamby FedClass

Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
    super().__init__()
    self.__flamby_fed_class = None
    self._transform = None

Functions

clear_dlp
clear_dlp()

Clears dlp and automatically clears the FedClass

Tries to guarantee some semblance of integrity by also clearing the FedClass, since setting the dlp initializes it.

Source code in fedbiomed/common/data/_flamby_dataset.py
def clear_dlp(self):
    """Clears dlp and automatically clears the FedClass

    Tries to guarantee some semblance of integrity by also clearing the FedClass, since setting the dlp
    initializes it.
    """
    super().clear_dlp()
    self._clear()
get_center_id
get_center_id()

Returns the center id. Requires that the DataLoadingPlan has already been set.

Returns:

Type Description
int

the center id (int).

Raises: FedbiomedDatasetError: in one of the two scenarios below - if the data loading plan is not set or is malformed. - if the wrapped FedClass is not initialized but the dlp exists

Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=True,
                                        require_uninitialized=False,
                                        message="Flamby dataset is in an inconsistent state: a Data Loading Plan "
                                                "is set but the wrapped FedClass was not initialized.")
@_requires_dlp
def get_center_id(self) -> int:
    """Returns the center id. Requires that the DataLoadingPlan has already been set.

    Returns:
        the center id (int).
    Raises:
        FedbiomedDatasetError: in one of the two scenarios below
            - if the data loading plan is not set or is malformed.
            - if the wrapped FedClass is not initialized but the dlp exists
    """
    return self.apply_dlb(None, FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA)['flamby_center_id']
get_dataset_type staticmethod
get_dataset_type()

Returns the Flamby DatasetType

Source code in fedbiomed/common/data/_flamby_dataset.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
    """Returns the Flamby DatasetType"""
    return DatasetTypes.FLAMBY
get_flamby_fed_class
get_flamby_fed_class()

Returns the instance of the wrapped Flamby FedClass

Source code in fedbiomed/common/data/_flamby_dataset.py
def get_flamby_fed_class(self):
    """Returns the instance of the wrapped Flamby FedClass"""
    return self.__flamby_fed_class
get_transform
get_transform()

Gets the transform attribute

Source code in fedbiomed/common/data/_flamby_dataset.py
def get_transform(self):
    """Gets the transform attribute"""
    return self._transform
init_transform
init_transform(transform)

Initializes the transform attribute. Must be called before initialization of the wrapped FedClass.

Parameters:

Name Type Description Default
transform Union[Compose, Compose]

a composed transform of type torchvision.transforms.Compose or monai.transforms.Compose

required

Raises:

Type Description
FedbiomedDatasetError

if the wrapped FedClass was already initialized.

FedbiomedDatasetValueError

if the input is not of the correct type.

Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=False,
                                        require_uninitialized=True,
                                        message="Calling init_transform is not allowed if the wrapped FedClass "
                                                "has already been initialized. At your own risk, you may call "
                                                "clear_dlp to reset the full FlambyDataset")
def init_transform(self, transform: Union[MonaiCompose, TorchCompose]) -> Union[MonaiCompose, TorchCompose]:
    """Initializes the transform attribute. Must be called before initialization of the wrapped FedClass.

    Arguments:
        transform: a composed transform of type torchvision.transforms.Compose or monai.transforms.Compose

    Raises:
        FedbiomedDatasetError: if the wrapped FedClass was already initialized.
        FedbiomedDatasetValueError: if the input is not of the correct type.
    """
    if not isinstance(transform, (MonaiCompose, TorchCompose)):
        msg = f"{ErrorNumbers.FB618.value}. FlambyDataset transform must be of type " \
              f"torchvision.transforms.Compose or monai.transforms.Compose"
        logger.critical(msg)
        raise FedbiomedDatasetValueError(msg)

    self._transform = transform
    return self._transform
set_dlp
set_dlp(dlp)

Sets the Data Loading Plan and ensures that the flamby_fed_class is initialized

Overrides the set_dlp function from the DataLoadingPlanMixin to make sure that self._init_flamby_fed_class is also called immediately after.

Source code in fedbiomed/common/data/_flamby_dataset.py
def set_dlp(self, dlp):
    """Sets the Data Loading Plan and ensures that the flamby_fed_class is initialized

    Overrides the set_dlp function from the DataLoadingPlanMixin to make sure that self._init_flamby_fed_class
    is also called immediately after.
    """
    super().set_dlp(dlp)
    try:
        self._init_flamby_fed_class()
    except FedbiomedDatasetError as e:
        # clean up
        super().clear_dlp()
        raise FedbiomedDatasetError from e
shape
shape()

Returns the shape of the flamby_fed_class

Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=True,
                                        require_uninitialized=False,
                                        message="Cannot compute shape because FedClass was not initialized.")
def shape(self) -> List[int]:
    """Returns the shape of the flamby_fed_class"""
    return [len(self)] + list(self.__getitem__(0)[0].shape)

FlambyDatasetMetadataBlock

FlambyDatasetMetadataBlock()

Bases: DataLoadingBlock

Metadata about a Flamby Dataset.

Includes information on: - identity of the type of flamby dataset (e.g. fed_ixi, fed_heart, etc...) - the ID of the center of the flamby dataset

Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
    super().__init__()
    self.metadata = {
        "flamby_dataset_name": None,
        "flamby_center_id": None
    }
    self._serialization_validator.update_validation_scheme(
        FlambyDatasetMetadataBlock._extra_validation_scheme())

Attributes

metadata instance-attribute
metadata = {'flamby_dataset_name': None, 'flamby_center_id': None}

Functions

apply
apply()

Returns a dictionary of dataset metadata.

The metadata dictionary contains: - flamby_dataset_name: (str) the name of the selected flamby dataset. - flamby_center_id: (int) the center id selected at dataset add time.

Note that the flamby_dataset_name will be the same as the module name required to instantiate the FedClass. However, it will not contain the full module path, hence to properly import this module it must be prepended with flamby.datasets, for example import flamby.datasets.flamby_dataset_name

Returns:

Type Description
dict

this data loading block's metadata

Source code in fedbiomed/common/data/_flamby_dataset.py
def apply(self) -> dict:
    """Returns a dictionary of dataset metadata.

    The metadata dictionary contains:
    - flamby_dataset_name: (str) the name of the selected flamby dataset.
    - flamby_center_id: (int) the center id selected at dataset add time.

    Note that the flamby_dataset_name will be the same as the module name required to instantiate the FedClass.
    However, it will not contain the full module path, hence to properly import this module it must be
    prepended with `flamby.datasets`, for example `import flamby.datasets.flamby_dataset_name`

    Returns:
        this data loading block's metadata
    """
    if any([v is None for v in self.metadata.values()]):
        msg = f"{ErrorNumbers.FB316}. Attempting to read Flamby dataset metadata, but " \
              f"the {[k for k,v in self.metadata.items() if v is None]} keys were not previously set."
        logger.critical(msg)
        raise FedbiomedLoadingBlockError(msg)
    return self.metadata
deserialize
deserialize(load_from)

Reconstruct the DataLoadingBlock from a serialized version.

Parameters:

Name Type Description Default
load_from dict

a dictionary as obtained by the serialize function.

required

Returns: the self instance

Source code in fedbiomed/common/data/_flamby_dataset.py
def deserialize(self, load_from: dict) -> DataLoadingBlock:
    """Reconstruct the DataLoadingBlock from a serialized version.

    Args:
        load_from: a dictionary as obtained by the serialize function.
    Returns:
        the self instance
    """
    super().deserialize(load_from)
    self.metadata['flamby_dataset_name'] = load_from['flamby_dataset_name']
    self.metadata['flamby_center_id'] = load_from['flamby_center_id']
    return self
serialize
serialize()

Serializes the class in a format similar to json.

Returns:

Type Description
dict

a dictionary of key-value pairs sufficient for reconstructing

dict

the DataLoadingBlock.

Source code in fedbiomed/common/data/_flamby_dataset.py
def serialize(self) -> dict:
    """Serializes the class in a format similar to json.

    Returns:
         a dictionary of key-value pairs sufficient for reconstructing
         the DataLoadingBlock.
    """
    ret = super().serialize()
    ret.update({'flamby_dataset_name': self.metadata['flamby_dataset_name'],
                'flamby_center_id': self.metadata['flamby_center_id']
                })
    return ret

FlambyLoadingBlockTypes

FlambyLoadingBlockTypes(*args)

Bases: DataLoadingBlockTypes, Enum

Additional DataLoadingBlockTypes specific to Flamby data

Source code in fedbiomed/common/constants.py
def __init__(self, *args):
    cls = self.__class__
    if not isinstance(self.value, str):
        raise ValueError("all fields of DataLoadingBlockTypes subclasses"
                         " must be of str type")
    if any(self.value == e.value for e in cls):
        a = self.name
        e = cls(self.value).name
        raise ValueError(
            f"duplicate values not allowed in DataLoadingBlockTypes and "
            f"its subclasses: {a} --> {e}")

Attributes

FLAMBY_DATASET_METADATA class-attribute instance-attribute
FLAMBY_DATASET_METADATA = 'flamby_dataset_metadata'

MapperBlock

MapperBlock()

Bases: DataLoadingBlock

A DataLoadingBlock for mapping values.

This DataLoadingBlock can be used whenever an "indirect mapping" is needed. For example, it can be used to implement a correspondence between a set of "logical" abstract names and a set of folder names on the filesystem.

The apply function of this DataLoadingBlock takes a "key" as input (a str) and returns the mapped value corresponding to map[key]. Note that while the constructor of this class sets a value for type_id, developers are recommended to set a more meaningful value that better speaks to their application.

Multiple instances of this loading_block may be used in the same DataLoadingPlan, provided that they are given different type_id via the constructor.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    super(MapperBlock, self).__init__()
    self.map = {}
    self._serialization_validator.update_validation_scheme(MapperBlock._extra_validation_scheme())

Attributes

map instance-attribute
map = {}

Functions

apply
apply(key)

Returns the value mapped to the key, if it exists.

Raises:

Type Description
FedbiomedLoadingBlockError

if map is not a dict or the key does not exist.

Source code in fedbiomed/common/data/_data_loading_plan.py
def apply(self, key):
    """Returns the value mapped to the key, if it exists.

    Raises:
        FedbiomedLoadingBlockError: if map is not a dict or the key does not exist.
    """
    if not isinstance(self.map, dict) or key not in self.map:
        msg = f"{ErrorNumbers.FB614.value} Mapper block error: no key '{key}' in mapping dictionary"
        logger.debug(msg)
        raise FedbiomedLoadingBlockError(msg)
    return self.map[key]
deserialize
deserialize(load_from)

Reconstruct the DataLoadingBlock from a serialized version.

Parameters:

Name Type Description Default
load_from dict

a dictionary as obtained by the serialize function.

required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, load_from: dict) -> DataLoadingBlock:
    """Reconstruct the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    from a serialized version.

    Args:
        load_from (dict): a dictionary as obtained by the serialize function.
    Returns:
        the self instance
    """
    super(MapperBlock, self).deserialize(load_from)
    self.map = load_from['map']
    return self
serialize
serialize()

Serializes the class in a format similar to json.

Returns:

Type Description
dict

a dictionary of key-value pairs sufficient for reconstructing

dict
Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> dict:
    """Serializes the class in a format similar to json.

    Returns:
        a dictionary of key-value pairs sufficient for reconstructing
        the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock].
    """
    ret = super(MapperBlock, self).serialize()
    ret.update({'map': self.map})
    return ret

MedicalFolderBase

MedicalFolderBase(root=None)

Bases: DataLoadingPlanMixin

Controller class for Medical Folder dataset.

Contains methods to validate the MedicalFolder folder hierarchy and extract folder-base metadata information such as modalities, number of subject etc.

Parameters:

Name Type Description Default
root Union[str, Path, None]

path to Medical Folder root folder.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, Path, None] = None):
    """Constructs MedicalFolderBase

    Args:
        root: path to Medical Folder root folder.
    """
    super(MedicalFolderBase, self).__init__()

    if root is not None:
        root = self.validate_MedicalFolder_root_folder(root)

    self._root = root

Attributes

default_modality_names class-attribute instance-attribute
default_modality_names = ['T1', 'T2', 'label']
root property writable
root

Root property of MedicalFolderController

Functions

available_subjects
available_subjects(subjects_from_index, subjects_from_folder=None)

Checks missing subject folders and missing entries in demographics

Parameters:

Name Type Description Default
subjects_from_index Union[list, Series]

Given subject folder names in demographics

required
subjects_from_folder list

List of subject folder names to get intersection of given subject_from_index

None

Returns:

Name Type Description
available_subjects list[str]

subjects that have an imaging data folder and are also present in the demographics file

missing_subject_folders list[str]

subjects that are in the demographics file but do not have an imaging data folder

missing_entries list[str]

subjects that have an imaging data folder but are not present in the demographics file

Source code in fedbiomed/common/data/_medical_datasets.py
def available_subjects(self,
                       subjects_from_index: Union[list, pd.Series],
                       subjects_from_folder: list = None) -> tuple[list[str], list[str], list[str]]:
    """Checks missing subject folders and missing entries in demographics

    Args:
        subjects_from_index: Given subject folder names in demographics
        subjects_from_folder: List of subject folder names to get intersection of given subject_from_index

    Returns:
        available_subjects: subjects that have an imaging data folder and are also present in the demographics file
        missing_subject_folders: subjects that are in the demographics file but do not have an imaging data folder
        missing_entries: subjects that have an imaging data folder but are not present in the demographics file
    """

    # Select all subject folders if it is not given
    if subjects_from_folder is None:
        subjects_from_folder = self.subjects_with_imaging_data_folders()

    # Missing subject that will cause warnings
    missing_subject_folders = list(set(subjects_from_index) - set(subjects_from_folder))

    # Missing entries that will cause errors
    missing_entries = list(set(subjects_from_folder) - set(subjects_from_index))

    # Intersection
    available_subjects = list(set(subjects_from_index).intersection(set(subjects_from_folder)))

    return available_subjects, missing_subject_folders, missing_entries
complete_subjects
complete_subjects(subjects, modalities)

Retrieves subjects that have given all the modalities.

Parameters:

Name Type Description Default
subjects List[str]

List of subject folder names

required
modalities List[str]

List of required modalities

required

Returns:

Type Description
List[str]

List of subject folder names that have required modalities

Source code in fedbiomed/common/data/_medical_datasets.py
def complete_subjects(self, subjects: List[str], modalities: List[str]) -> List[str]:
    """Retrieves subjects that have given all the modalities.

    Args:
        subjects: List of subject folder names
        modalities: List of required modalities

    Returns:
        List of subject folder names that have required modalities
    """
    return [subject for subject in subjects if all(self.is_modalities_existing(subject, modalities))]
demographics_column_names staticmethod
demographics_column_names(path)
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def demographics_column_names(path: Union[str, Path]):
    return MedicalFolderBase.read_demographics(path).columns.values
get_dataset_type staticmethod
get_dataset_type()
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
    return DatasetTypes.MEDICAL_FOLDER
is_modalities_existing
is_modalities_existing(subject, modalities)

Checks whether given modalities exists in the subject directory

Parameters:

Name Type Description Default
subject str

Subject ID or subject folder name

required
modalities List[str]

List of modalities to check

required

Returns:

Type Description
List[bool]

List of bool that represents whether modality is existing respectively for each of modality.

Raises:

Type Description
FedbiomedDatasetError

bad argument type

Source code in fedbiomed/common/data/_medical_datasets.py
def is_modalities_existing(self, subject: str, modalities: List[str]) -> List[bool]:
    """Checks whether given modalities exists in the subject directory

    Args:
        subject: Subject ID or subject folder name
        modalities: List of modalities to check

    Returns:
        List of `bool` that represents whether modality is existing respectively for each of modality.

    Raises:
        FedbiomedDatasetError: bad argument type
    """
    if not isinstance(subject, str):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected string for subject folder/ID, "
                                    f"but got {type(subject)}")
    if not isinstance(modalities, list):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected a list for modalities, "
                                    f"but got {type(modalities)}")
    if not all([type(m) is str for m in modalities]):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected a list of string for modalities, "
                                    f"but some modalities are "
                                    f"{' '.join([ str(type(m) for m in modalities if type(m) != str)])}")
    are_modalities_existing = list()
    for modality in modalities:
        modality_folder = self._subject_modality_folder(subject, modality)
        are_modalities_existing.append(bool(modality_folder) and
                                       self._root.joinpath(subject, modality_folder).is_dir())
    return are_modalities_existing
modalities
modalities()

Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.

Returns:

Type Description
list

List of unique available modalities

list

List of all encountered modality folders in each subject folder, appearing once per folder

Source code in fedbiomed/common/data/_medical_datasets.py
def modalities(self) -> Tuple[list, list]:
    """Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.

    Returns:
         List of unique available modalities
         List of all encountered modality folders in each subject folder, appearing once per folder
    """
    modality_candidates, modality_folders_list = self.modalities_candidates_from_subfolders()
    if self._dlp is not None and MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS in self._dlp:
        modalities = list(self._dlp[MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS].map.keys())
        return modalities, modality_folders_list
    else:
        return modality_candidates, modality_folders_list
modalities_candidates_from_subfolders
modalities_candidates_from_subfolders()

Gets all possible modality folders under root directory

Returns:

Type Description
list

List of unique available modality folders appearing at least once

list

List of all encountered modality folders in each subject folder, appearing once per folder

Source code in fedbiomed/common/data/_medical_datasets.py
def modalities_candidates_from_subfolders(self) -> Tuple[list, list]:
    """ Gets all possible modality folders under root directory

    Returns:
         List of unique available modality folders appearing at least once
         List of all encountered modality folders in each subject folder, appearing once per folder
    """

    # Accept only folders that don't start with "." and "_"
    modalities = [f.name for f in self._root.glob("*/*") if f.is_dir() and not f.name.startswith((".", "_"))]
    return sorted(list(set(modalities))), modalities
read_demographics staticmethod
read_demographics(path, index_col=None)

Read demographics tabular file for Medical Folder dataset

Raises:

Type Description
FedbiomedDatasetError

bad file format

Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def read_demographics(path: Union[str, Path], index_col: Optional[int] = None):
    """ Read demographics tabular file for Medical Folder dataset

    Raises:
        FedbiomedDatasetError: bad file format
    """
    path = Path(path)
    if not path.is_file() or path.suffix.lower() not in [".csv", ".tsv"]:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Demographics should be CSV or TSV files")

    return pd.read_csv(path, index_col=index_col, engine='python')
subjects_with_imaging_data_folders
subjects_with_imaging_data_folders()

Retrieves subject folder names under Medical Folder root directory.

Returns:

Type Description
List[str]

subject folder names under Medical Folder root directory.

Source code in fedbiomed/common/data/_medical_datasets.py
def subjects_with_imaging_data_folders(self) -> List[str]:
    """Retrieves subject folder names under Medical Folder root directory.

    Returns:
        subject folder names under Medical Folder root directory.
    """
    return [f.name for f in self._root.iterdir() if f.is_dir() and not f.name.startswith(".")]
validate_MedicalFolder_root_folder staticmethod
validate_MedicalFolder_root_folder(path)

Validates Medical Folder root directory by checking folder structure

Parameters:

Name Type Description Default
path Union[str, Path]

path to root directory

required

Returns:

Type Description
Path

Path to root folder of Medical Folder dataset

Raises:

Type Description
FedbiomedDatasetError
  • If path is not an instance of str or pathlib.Path - If path is not a directory
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def validate_MedicalFolder_root_folder(path: Union[str, Path]) -> Path:
    """ Validates Medical Folder root directory by checking folder structure

    Args:
        path: path to root directory

    Returns:
        Path to root folder of Medical Folder dataset

    Raises:
        FedbiomedDatasetError: - If path is not an instance of `str` or `pathlib.Path`
                               - If path is not a directory
    """
    if not isinstance(path, (Path, str)):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: The argument root should an instance of "
                                    f"`Path` or `str`, but got {type(path)}")

    if not isinstance(path, Path):
        path = Path(path)

    path = Path(path).expanduser().resolve()

    if not path.exists():
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Folder or file {path} not found on system")
    if not path.is_dir():
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Root for Medical Folder dataset "
                                    f"should be a directory.")

    directories = [f for f in path.iterdir() if f.is_dir()]
    if len(directories) == 0:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Root folder of Medical Folder should "
                                    f"contain subject folders, but no sub folder has been found. ")

    modalities = [f for f in path.glob("*/*") if f.is_dir()]
    if len(modalities) == 0:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value} Subject folders for Medical Folder should "
                                    f"contain modalities as folders. Folder structure should be "
                                    f"root/<subjects>/<modalities>")

    return path

MedicalFolderController

MedicalFolderController(root=None)

Bases: MedicalFolderBase

Utility class to construct and verify Medical Folder datasets without knowledge of the experiment.

The purpose of this class is to enable key functionalities related to the MedicalFolderDataset at the time of dataset deployment, i.e. when the data is being added to the node's database.

Specifically, the MedicalFolderController class can be used to: - construct a MedicalFolderDataset with all available data modalities, without knowing which ones will be used as targets or features during an experiment - validate that the proper folder structure has been respected by the data managers preparing the data - identify which subjects have which modalities

Parameters:

Name Type Description Default
root str

Folder path to dataset. Defaults to None.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: str = None):
    """Constructs MedicalFolderController

    Args:
        root: Folder path to dataset. Defaults to None.
    """
    super(MedicalFolderController, self).__init__(root=root)

Functions

load_MedicalFolder
load_MedicalFolder(tabular_file=None, index_col=None)

Load Medical Folder dataset with given tabular_file and index_col

Parameters:

Name Type Description Default
tabular_file Union[str, Path]

File path to demographics data set

None
index_col Union[str, int]

Column index that represents subject folder names

None

Returns:

Type Description
MedicalFolderDataset

MedicalFolderDataset object

Raises:

Type Description
FedbiomedDatasetError

If Medical Folder dataset is not successfully loaded

Source code in fedbiomed/common/data/_medical_datasets.py
def load_MedicalFolder(self,
                       tabular_file: Union[str, Path] = None,
                       index_col: Union[str, int] = None) -> MedicalFolderDataset:
    """ Load Medical Folder dataset with given tabular_file and index_col

    Args:
        tabular_file: File path to demographics data set
        index_col: Column index that represents subject folder names

    Returns:
        MedicalFolderDataset object

    Raises:
        FedbiomedDatasetError: If Medical Folder dataset is not successfully loaded
    """
    if self._root is None:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Can not load Medical Folder dataset without "
                                    f"declaring root directory. Please set root or build MedicalFolderController "
                                    f"with by providing `root` argument use")

    modalities, _ = self.modalities()

    try:
        dataset = MedicalFolderDataset(root=self._root,
                                       tabular_file=tabular_file,
                                       index_col=index_col,
                                       data_modalities=modalities,
                                       target_modalities=modalities)
    except FedbiomedError as e:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Can not create Medical Folder dataset. {e}")

    if self._dlp is not None:
        dataset.set_dlp(self._dlp)
    return dataset
subject_modality_status
subject_modality_status(index=None)

Scans subjects and checks which modalities are existing for each subject

Parameters:

Name Type Description Default
index Union[List, Series]

Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject folder names. Defaults to None.

None

Returns: Modality status for each subject that indicates which modalities are available

Source code in fedbiomed/common/data/_medical_datasets.py
def subject_modality_status(self, index: Union[List, pd.Series] = None) -> Dict:
    """Scans subjects and checks which modalities are existing for each subject

    Args:
        index: Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject
            folder names. Defaults to None.
    Returns:
        Modality status for each subject that indicates which modalities are available
    """

    modalities, _ = self.modalities()
    subjects = self.subjects_with_imaging_data_folders()
    modality_status = {"columns": [*modalities], "data": [], "index": []}

    if index is not None:
        _, missing_subjects, missing_entries = self.available_subjects(subjects_from_index=index)
        modality_status["columns"].extend(["in_folder", "in_index"])

    for subject in subjects:
        modality_report = self.is_modalities_existing(subject, modalities)
        status_list = [status for status in modality_report]
        if index is not None:
            status_list.append(False if subject in missing_subjects else True)
            status_list.append(False if subject in missing_entries else True)

        modality_status["data"].append(status_list)
        modality_status["index"].append(subject)

    return modality_status

MedicalFolderDataset

MedicalFolderDataset(root, data_modalities='T1', transform=None, target_modalities='label', target_transform=None, demographics_transform=None, tabular_file=None, index_col=None)

Bases: Dataset, MedicalFolderBase

Torch dataset following the Medical Folder Structure.

The Medical Folder structure is loosely inspired by the BIDS standard [1]. It should respect the following pattern:

└─ MedicalFolder_root/
    └─ demographics.csv
    └─ sub-01/
        ├─ T1/
        │  └─ sub-01_xxx.nii.gz
        └─ T2/
            ├─ sub-01_xxx.nii.gz
where the first-level subfolders or the root correspond to the subjects, and each subject's folder contains subfolders for each imaging modality. Images should be in Nifti format, with either the .nii or .nii.gz extensions. Finally, within the root folder there should also be a demographics file containing at least one index column with the names of the subject folders. This column will be used to explore the data and load the images. The demographics file may contain additional information about each subject and will be loaded alongside the images by our framework.

[1] https://bids.neuroimaging.io/

Parameters:

Name Type Description Default
root Union[str, PathLike, Path]

Root folder containing all the subject directories.

required
data_modalities (str, Iterable)

Modality or modalities to be used as data sources.

'T1'
transform Union[Callable, Dict[str, Callable]]

A function or dict of function transform(s) that preprocess each data source.

None
target_modalities Optional[Union[str, Iterable[str]]]

(str, Iterable): Modality or modalities to be used as target sources.

'label'
target_transform Union[Callable, Dict[str, Callable]]

A function or dict of function transform(s) that preprocess each target source.

None
demographics_transform Optional[Callable]

TODO

None
tabular_file Union[str, PathLike, Path, None]

Path to a CSV or Excel file containing the demographic information from the patients.

None
index_col Union[int, str, None]

Column name in the tabular file containing the subject ids which mush match the folder names.

None
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self,
             root: Union[str, PathLike, Path],
             data_modalities: Optional[Union[str, Iterable[str]]] = 'T1',
             transform: Union[Callable, Dict[str, Callable]] = None,
             target_modalities: Optional[Union[str, Iterable[str]]] = 'label',
             target_transform: Union[Callable, Dict[str, Callable]] = None,
             demographics_transform: Optional[Callable] = None,
             tabular_file: Union[str, PathLike, Path, None] = None,
             index_col: Union[int, str, None] = None,
             ):
    """Constructor for class `MedicalFolderDataset`.

    Args:
        root: Root folder containing all the subject directories.
        data_modalities (str, Iterable): Modality or modalities to be used as data sources.
        transform: A function or dict of function transform(s) that preprocess each data source.
        target_modalities: (str, Iterable): Modality or modalities to be used as target sources.
        target_transform: A function or dict of function transform(s) that preprocess each target source.
        demographics_transform: TODO
        tabular_file: Path to a CSV or Excel file containing the demographic information from the patients.
        index_col: Column name in the tabular file containing the subject ids which mush match the folder names.
    """
    super(MedicalFolderDataset, self).__init__(root=root)

    self._tabular_file = tabular_file
    self._index_col = index_col

    self._data_modalities = [data_modalities] if isinstance(data_modalities, str) else data_modalities
    self._target_modalities = [target_modalities] if isinstance(target_modalities, str) else target_modalities

    self._transform = self._check_and_reformat_transforms(transform, data_modalities)
    self._target_transform = self._check_and_reformat_transforms(target_transform, target_modalities)
    self._demographics_transform = demographics_transform if demographics_transform is not None else lambda x: {}

    # Image loader
    self._reader = Compose([
        LoadImage(ITKReader(), image_only=True),
        ToTensor()
    ])

Attributes

ALLOWED_EXTENSIONS class-attribute instance-attribute
ALLOWED_EXTENSIONS = ['.nii', '.nii.gz']
demographics cached property
demographics

Loads tabular data file (supports excel, csv, tsv and colon separated value files).

index_col property writable
index_col

Getter/setter of the column containing folder's name (in the tabular file)

subjects_has_all_modalities property
subjects_has_all_modalities

Gets only the subjects that have all required modalities

subjects_registered_in_demographics cached property
subjects_registered_in_demographics

Gets the subject only those who are present in the demographics file.

tabular_file property writable
tabular_file

Functions

get_nontransformed_item
get_nontransformed_item(item)
Source code in fedbiomed/common/data/_medical_datasets.py
def get_nontransformed_item(self, item):
    # For the first item retrieve complete subject folders
    subjects = self.subject_folders()

    if not subjects:
        # case where subjects is an empty list (subject folders have not been found)
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Cannot find complete subject folders with all the modalities")
    # Get subject folder
    subject_folder = subjects[item]

    # Load data modalities
    data = self.load_images(subject_folder, modalities=self._data_modalities)

    # Load target modalities
    targets = self.load_images(subject_folder, modalities=self._target_modalities)

    # Demographics
    demographics = self._get_from_demographics(subject_id=subject_folder.name)
    return (data, demographics), targets
load_images
load_images(subject_folder, modalities)

Loads modality images in given subject folder

Parameters:

Name Type Description Default
subject_folder Path

Subject folder where modalities are stored

required
modalities list

List of available modalities

required

Returns:

Type Description
Dict[str, Tensor]

Subject image data as victories where keys represent each modality.

Source code in fedbiomed/common/data/_medical_datasets.py
def load_images(self, subject_folder: Path, modalities: list) -> Dict[str, torch.Tensor]:
    """Loads modality images in given subject folder

    Args:
        subject_folder: Subject folder where modalities are stored
        modalities: List of available modalities

    Returns:
        Subject image data as victories where keys represent each modality.
    """
    subject_data = {}

    for modality in modalities:
        modality_folder = self._subject_modality_folder(subject_folder, modality)
        image_folder = subject_folder.joinpath(modality_folder)
        nii_files = [p.resolve() for p in image_folder.glob("**/*")
                     if ''.join(p.suffixes) in self.ALLOWED_EXTENSIONS]

        # Load the first, we assume there is going to be a single image per modality for now.
        img_path = nii_files[0]
        img = self._reader(img_path)
        subject_data[modality] = img

    return subject_data
set_dataset_parameters
set_dataset_parameters(parameters)

Sets dataset parameters.

Parameters:

Name Type Description Default
parameters dict

Parameters to initialize

required

Raises:

Type Description
FedbiomedDatasetError

If given parameters are not of dict type

Source code in fedbiomed/common/data/_medical_datasets.py
def set_dataset_parameters(self, parameters: dict):
    """Sets dataset parameters.

    Args:
        parameters: Parameters to initialize

    Raises:
        FedbiomedDatasetError: If given parameters are not of `dict` type
    """
    if not isinstance(parameters, dict):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected type for `parameters` is `dict, "
                                    f"but got {type(parameters)}`")

    for key, value in parameters.items():
        if hasattr(self, key):
            setattr(self, key, value)
        else:
            raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Trying to set non existing attribute '{key}'")
shape
shape()

Retrieves shape information for modalities and demographics csv

Source code in fedbiomed/common/data/_medical_datasets.py
def shape(self) -> dict:
    """Retrieves shape information for modalities and demographics csv"""

    # Get all modalities
    data_modalities = list(set(self._data_modalities))
    target_modalities = list(set(self._target_modalities))
    modalities = list(set(self._data_modalities + self._target_modalities))
    (image, _), targets = self.get_nontransformed_item(0)

    result = {modality: list(image[modality].shape) for modality in data_modalities}

    result.update({modality: list(targets[modality].shape) for modality in target_modalities})
    num_modalities = len(modalities)
    demographics_shape = self.demographics.shape if self.demographics is not None else None
    result.update({"demographics": demographics_shape, "num_modalities": num_modalities})

    return result
subject_folders
subject_folders()

Retrieves subject folder names of only those who have their complete modalities

Returns:

Type Description
List[Path]

List of subject directories that has all requested modalities

Source code in fedbiomed/common/data/_medical_datasets.py
def subject_folders(self) -> List[Path]:
    """Retrieves subject folder names of only those who have their complete modalities

    Returns:
        List of subject directories that has all requested modalities
    """

    # If demographics are present
    if self._tabular_file and self._index_col is not None:
        complete_subject_folders = self.subjects_registered_in_demographics
    else:
        complete_subject_folders = self.subjects_has_all_modalities

    return [self._root.joinpath(folder) for folder in complete_subject_folders]

MedicalFolderLoadingBlockTypes

MedicalFolderLoadingBlockTypes(*args)

Bases: DataLoadingBlockTypes, Enum

Source code in fedbiomed/common/constants.py
def __init__(self, *args):
    cls = self.__class__
    if not isinstance(self.value, str):
        raise ValueError("all fields of DataLoadingBlockTypes subclasses"
                         " must be of str type")
    if any(self.value == e.value for e in cls):
        a = self.name
        e = cls(self.value).name
        raise ValueError(
            f"duplicate values not allowed in DataLoadingBlockTypes and "
            f"its subclasses: {a} --> {e}")

Attributes

MODALITIES_TO_FOLDERS class-attribute instance-attribute
MODALITIES_TO_FOLDERS = 'modalities_to_folders'

NIFTIFolderDataset

NIFTIFolderDataset(root, transform=None, target_transform=None)

Bases: Dataset

A Generic class for loading NIFTI Images using the folder structure as the target classes' labels.

Supported formats: - NIFTI and compressed NIFTI files: .nii, .nii.gz

This is a Dataset useful in classification tasks. Its usage is quite simple, quite similar to torchvision.datasets.ImageFolder. Images must be contained in first level sub-folders (level 2+ sub-folders are ignored) that describe the target class they belong to (target class label is the name of the folder).

nifti_dataset_root_folder
├── control_group
│   ├── subject_1.nii
│   └── subject_2.nii
│   └── ...
└── disease_group
    ├── subject_3.nii
    └── subject_4.nii
    └── ...

In this example, there are 4 samples (one from each *.nii file), 2 target class, with labels control_group and disease_group. subject_1.nii has class label control_group, subject_3.nii has class label disease_group,etc.

Parameters:

Name Type Description Default
root Union[str, PathLike, Path]

folder where the data is located.

required
transform Union[Callable, None]

transforms to be applied on data.

None
target_transform Union[Callable, None]

transforms to be applied on target indexes.

None

Raises:

Type Description
FedbiomedDatasetError

bad argument type

FedbiomedDatasetError

bad root path

Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, PathLike, Path],
             transform: Union[Callable, None] = None,
             target_transform: Union[Callable, None] = None
             ):
    """Constructor of the class

    Args:
        root: folder where the data is located.
        transform: transforms to be applied on data.
        target_transform: transforms to be applied on target indexes.

    Raises:
        FedbiomedDatasetError: bad argument type
        FedbiomedDatasetError: bad root path
    """
    # check parameters type
    for tr, trname in ((transform, 'transform'), (target_transform, 'target_transform')):
        if not callable(tr) and tr is not None:
            raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter {trname} has incorrect "
                                        f"type {type(tr)}, cannot create dataset.")

    if not isinstance(root, str) and not isinstance(root, PathLike) and not isinstance(root, Path):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter `root` has incorrect type "
                                    f"{type(root)}, cannot create dataset.")

    # initialize object variables
    self._files = []
    self._class_labels = []
    self._targets = []

    try:
        self._root_dir = Path(root).expanduser()
    except RuntimeError as e:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB612.value}: Cannot expand path {root}, error message is: {e}")

    self._transform = transform
    self._target_transform = target_transform
    self._reader = Compose([
        LoadImage(ITKReader(), image_only=True),
        ToTensor()
    ])

    self._explore_root_folder()

Functions

files
files()

Retrieves the paths to the sample images.

Gives sames order as when retrieving the sample images (eg self.files[0] is the path to self.__getitem__[0])

Returns:

Type Description
List[Path]

List of the absolute paths to the sample images

Source code in fedbiomed/common/data/_medical_datasets.py
def files(self) -> List[Path]:
    """Retrieves the paths to the sample images.

    Gives sames order as when retrieving the sample images (eg `self.files[0]`
    is the path to `self.__getitem__[0]`)

    Returns:
        List of the absolute paths to the sample images
    """
    return self._files
labels
labels()

Retrieves the labels of the target classes.

Target label index is the index of the corresponding label in this list.

Returns:

Type Description
List[str]

List of the labels of the target classes.

Source code in fedbiomed/common/data/_medical_datasets.py
def labels(self) -> List[str]:
    """Retrieves the labels of the target classes.

    Target label index is the index of the corresponding label in this list.

    Returns:
        List of the labels of the target classes.
    """
    return self._class_labels

NPDataLoader

NPDataLoader(dataset, target, batch_size=1, shuffle=False, random_seed=None, drop_last=False)

DataLoader for a Numpy dataset.

This data loader encapsulates a dataset composed of numpy arrays and presents an Iterable interface. One design principle was to try to make the interface as similar as possible to a torch.DataLoader.

Attributes:

Name Type Description
_dataset

(np.ndarray) a 2d array of features

_target

(np.ndarray) an optional array of target values

_batch_size

(int) the number of elements in one batch

_shuffle

(bool) if True, shuffle the data at the beginning of every epoch

_drop_last

(bool) if True, drop the last batch if it does not contain batch_size elements

_rng

(np.random.Generator) the random number generator for shuffling

Parameters:

Name Type Description Default
dataset ndarray

2D Numpy array

required
target ndarray

Numpy array of target values

required
batch_size int

batch size for each iteration

1
shuffle bool

shuffle before iteration

False
random_seed Optional[int]

an optional integer to set the numpy random seed for shuffling. If it equals None, then no attempt will be made to set the random seed.

None
drop_last bool

whether to drop the last batch in case it does not fill the whole batch size

False
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
             dataset: np.ndarray,
             target: np.ndarray,
             batch_size: int = 1,
             shuffle: bool = False,
             random_seed: Optional[int] = None,
             drop_last: bool = False):
    """Construct numpy data loader

    Args:
        dataset: 2D Numpy array
        target: Numpy array of target values
        batch_size: batch size for each iteration
        shuffle: shuffle before iteration
        random_seed: an optional integer to set the numpy random seed for shuffling. If it equals
            None, then no attempt will be made to set the random seed.
        drop_last: whether to drop the last batch in case it does not fill the whole batch size
    """

    if not isinstance(dataset, np.ndarray) or not isinstance(target, np.ndarray):
        msg = f"{ErrorNumbers.FB609.value}. Wrong input type for `dataset` or `target` in NPDataLoader. " \
              f"Expected type np.ndarray for both, instead got {type(dataset)} and" \
              f"{type(target)} respectively."
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # If the researcher gave a 1-dimensional dataset, we expand it to 2 dimensions
    if dataset.ndim == 1:
        dataset = dataset[:, np.newaxis]

    # If the researcher gave a 1-dimensional target, we expand it to 2 dimensions
    if target.ndim == 1:
        target = target[:, np.newaxis]

    if dataset.ndim != 2 or target.ndim != 2:
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Wrong shape for `dataset` or `target` in "
            f"NPDataLoader. Expected 2-dimensional arrays, instead got {dataset.ndim}- "
            f"dimensional and {target.ndim}-dimensional arrays respectively.")

    if len(dataset) != len(target):
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Inconsistent length for `dataset` and `target` "
            f"in NPDataLoader. Expected same length, instead got len(dataset)={len(dataset)}, "
            f"len(target)={len(target)}")

    if not isinstance(batch_size, int) or batch_size <= 0:
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Wrong value for `batch_size` parameter of "
            f"NPDataLoader. Expected a non-zero positive integer, instead got value {batch_size}.")

    if random_seed is not None and not isinstance(random_seed, int):
        raise FedbiomedTypeError(
            f"{ErrorNumbers.FB609.value}. Wrong type for `random_seed` parameter of "
            f"NPDataLoader. Expected int or None, instead got {type(random_seed)}.")

    self._dataset = dataset
    self._target = target
    self._batch_size = batch_size
    self._shuffle = shuffle
    self._drop_last = drop_last
    self._rng = np.random.default_rng(random_seed)

Attributes

dataset property
dataset

Returns the encapsulated dataset

This needs to be a property to harmonize the API with torch.DataLoader, enabling us to write generic code for both DataLoaders.

target property
target

Returns the array of target values

This has been made a property to have a homogeneous interface with the dataset property above.

Functions

batch_size
batch_size()

Returns the batch size

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def batch_size(self) -> int:
    """Returns the batch size"""
    return self._batch_size
drop_last
drop_last()

Returns the boolean drop_last attribute

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def drop_last(self) -> bool:
    """Returns the boolean drop_last attribute"""
    return self._drop_last
n_remainder_samples
n_remainder_samples()

Returns the remainder of the division between dataset length and batch size.

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def n_remainder_samples(self) -> int:
    """Returns the remainder of the division between dataset length and batch size."""
    return len(self._dataset) % self._batch_size
rng
rng()

Returns the random number generator

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def rng(self) -> np.random.Generator:
    """Returns the random number generator"""
    return self._rng
shuffle
shuffle()

Returns the boolean shuffle attribute

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def shuffle(self) -> bool:
    """Returns the boolean shuffle attribute"""
    return self._shuffle

SerializationValidation

SerializationValidation()

Provide Validation capabilities for serializing/deserializing a [DataLoadingBlock] or [DataLoadingPlan].

When a developer inherits from [DataLoadingBlock] to define a custom loading block, they are required to call the _serialization_validator.update_validation_scheme function with a dictionary argument containing the rules to validate all the additional fields that will be used in the serialization of their loading block.

These rules must follow the syntax explained in the SchemeValidator class.

For example

    class MyLoadingBlock(DataLoadingBlock):
        def __init__(self):
            self.my_custom_data = {}
            self._serialization_validator.update_validation_scheme({
                'custom_data': {
                    'rules': [dict, ...any other rules],
                    'required': True
                }
            })
        def serialize(self):
            serialized = super().serialize()
            serialized.update({'custom_data': self.my_custom_data})
            return serialized

Attributes:

Name Type Description
_validation_scheme

(dict) an extensible set of rules to validate the DataLoadingBlock metadata.

Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
    self._validation_scheme = {}

Functions

dlb_default_scheme classmethod
dlb_default_scheme()

The dictionary of default validation rules for a serialized [DataLoadingBlock].

Source code in fedbiomed/common/data/_data_loading_plan.py
@classmethod
def dlb_default_scheme(cls) -> Dict:
    """The dictionary of default validation rules for a serialized [DataLoadingBlock]."""
    return {
        'loading_block_class': {
            'rules': [str, cls._identifier_validation_hook],
            'required': True,
        },
        'loading_block_module': {
            'rules': [str, cls._identifier_validation_hook],
            'required': True,
        },
        'dlb_id': {
            'rules': [str, cls._serial_id_validation_hook],
            'required': True,
        },
    }
dlp_default_scheme classmethod
dlp_default_scheme()

The dictionary of default validation rules for a serialized [DataLoadingPlan].

Source code in fedbiomed/common/data/_data_loading_plan.py
@classmethod
def dlp_default_scheme(cls) -> Dict:
    """The dictionary of default validation rules for a serialized [DataLoadingPlan]."""
    return {
        'dlp_id': {
            'rules': [str],
            'required': True,
        },
        'dlp_name': {
            'rules': [str],
            'required': True,
        },
        'target_dataset_type': {
            'rules': [str, cls._target_dataset_type_validator],
            'required': True,
        },
        'loading_blocks': {
            'rules': [dict, cls._loading_blocks_types_validator],
            'required': True
        },
        'key_paths': {
            'rules': [dict, cls._key_paths_validator],
            'required': True
        }
    }
update_validation_scheme
update_validation_scheme(new_scheme)

Updates the validation scheme.

Parameters:

Name Type Description Default
new_scheme dict

(dict) new dict of rules

required
Source code in fedbiomed/common/data/_data_loading_plan.py
def update_validation_scheme(self, new_scheme: dict) -> None:
    """Updates the validation scheme.

    Args:
        new_scheme: (dict) new dict of rules
    """
    self._validation_scheme.update(new_scheme)
validate
validate(dlb_metadata, exception_type, only_required=True)

Validate a dict of dlb_metadata according to the _validation_scheme.

Parameters:

Name Type Description Default
dlb_metadata dict)

the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the node database.

required
exception_type Type[FedbiomedError]

the type of the exception to be raised when validation fails.

required
only_required bool)

see SchemeValidator.populate_with_defaults

True

Raises: exception_type: if the validation fails.

Source code in fedbiomed/common/data/_data_loading_plan.py
def validate(self,
             dlb_metadata: Dict,
             exception_type: Type[FedbiomedError],
             only_required: bool = True) -> None:
    """Validate a dict of dlb_metadata according to the _validation_scheme.

    Args:
        dlb_metadata (dict) : the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the
            node database.
        exception_type (Type[FedbiomedError]): the type of the exception to be raised when validation fails.
        only_required (bool) : see SchemeValidator.populate_with_defaults
    Raises:
        exception_type: if the validation fails.
    """
    try:
        sc = SchemeValidator(self._validation_scheme)
    except RuleError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

    try:
        dlb_metadata = sc.populate_with_defaults(dlb_metadata,
                                                 only_required=only_required)
    except ValidatorError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

    try:
        sc.validate(dlb_metadata)
    except ValidateError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

SkLearnDataManager

SkLearnDataManager(inputs, target, **kwargs)

Bases: object

Wrapper for pd.DataFrame, pd.Series and np.ndarray datasets.

Manages datasets for scikit-learn based model training. Responsible for managing inputs, and target variables that have been provided in training_data of scikit-learn based training plans.

The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the [fedbiomed.common.data.NPDataLoader] class for more details.

Parameters:

Name Type Description Default
inputs Union[ndarray, DataFrame, Series]

Independent variables (inputs, features) for model training

required
target Union[ndarray, DataFrame, Series]

Dependent variable/s (target) for model training and validation

required
**kwargs dict

Loader arguments

{}
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
             inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
             target: Union[np.ndarray, pd.DataFrame, pd.Series],
             **kwargs: dict):

    """ Construct a SkLearnDataManager from an array of inputs and an array of targets.

    The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated
    when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the
    [fedbiomed.common.data.NPDataLoader] class for more details.

    Args:
        inputs: Independent variables (inputs, features) for model training
        target: Dependent variable/s (target) for model training and validation
        **kwargs: Loader arguments
    """

    if not isinstance(inputs, (np.ndarray, pd.DataFrame, pd.Series)) or \
            not isinstance(target, (np.ndarray, pd.DataFrame, pd.Series)):
        msg = f"{ErrorNumbers.FB609.value}. Parameters `inputs` and `target` for " \
              f"initialization of {self.__class__.__name__} should be one of np.ndarray, pd.DataFrame, pd.Series"
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # Convert pd.DataFrame or pd.Series to np.ndarray for `inputs`
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self._inputs = inputs.to_numpy()
    else:
        self._inputs = inputs

    # Convert pd.DataFrame or pd.Series to np.ndarray for `target`
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self._target = target.to_numpy()
    else:
        self._target = target

    # Additional loader arguments
    self._loader_arguments = kwargs

    # Subset None means that train/validation split has not been performed
    self._subset_test: Union[Tuple[np.ndarray, np.ndarray], None] = None
    self._subset_train: Union[Tuple[np.ndarray, np.ndarray], None] = None

Functions

dataset
dataset()

Gets the entire registered dataset.

This method returns whole dataset as it is without any split.

Returns:

Name Type Description
inputs ndarray

Input variables for model training

targets ndarray

Target variable for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def dataset(self) -> Tuple[np.ndarray, np.ndarray]:
    """Gets the entire registered dataset.

    This method returns whole dataset as it is without any split.

    Returns:
         inputs: Input variables for model training
         targets: Target variable for model training
    """
    return self._inputs, self._target
split
split(test_ratio, test_batch_size)

Splits np.ndarray dataset into train and validation.

Parameters:

Name Type Description Default
test_ratio float

Ratio for validation set partition. Rest of the samples will be used for training

required

Raises:

Type Description
FedbiomedSkLearnDataManagerError

If the test_ratio is not between 0 and 1

Returns:

Name Type Description
train_loader NPDataLoader

NPDataLoader of input variables for model training

test_loader NPDataLoader

NPDataLoader of target variable for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def split(self, test_ratio: float, test_batch_size: int) -> Tuple[NPDataLoader, NPDataLoader]:
    """Splits `np.ndarray` dataset into train and validation.

    Args:
         test_ratio: Ratio for validation set partition. Rest of the samples will be used for training

    Raises:
        FedbiomedSkLearnDataManagerError: If the `test_ratio` is not between 0 and 1

    Returns:
         train_loader: NPDataLoader of input variables for model training
         test_loader: NPDataLoader of target variable for model training
    """
    if not isinstance(test_ratio, float):
        msg = f'{ErrorNumbers.FB609.value}: The argument `ratio` should be type `float` not {type(test_ratio)}'
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    if test_ratio < 0. or test_ratio > 1.:
        msg = f'{ErrorNumbers.FB609.value}: The argument `ratio` should be equal or between 0 and 1, ' \
             f'not {test_ratio}'
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    empty_subset = (np.array([]), np.array([]))

    if test_ratio <= 0.:
        self._subset_train = (self._inputs, self._target)
        self._subset_test = empty_subset
    elif test_ratio >= 1.:
        self._subset_train = empty_subset
        self._subset_test = (self._inputs, self._target)
    else:
        x_train, x_test, y_train, y_test = train_test_split(self._inputs, self._target, test_size=test_ratio)
        self._subset_test = (x_test, y_test)
        self._subset_train = (x_train, y_train)

    if not test_batch_size:
        test_batch_size = len(self._subset_test)

    return self._subset_loader(self._subset_train, **self._loader_arguments), \
        self._subset_loader(self._subset_test, batch_size=test_batch_size)
subset_test
subset_test()

Gets Subset of dataset for validation partition.

Returns:

Name Type Description
test_inputs ndarray

Input variables of validation subset for model validation

test_target ndarray

Target variable of validation subset for model validation

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def subset_test(self) -> Tuple[np.ndarray, np.ndarray]:
    """Gets Subset of dataset for validation partition.

    Returns:
        test_inputs: Input variables of validation subset for model validation
        test_target: Target variable of validation subset for model validation
    """
    return self._subset_test
subset_train
subset_train()

Gets Subset for train partition.

Returns:

Name Type Description
test_inputs ndarray

Input variables of training subset for model training

test_target ndarray

Target variable of training subset for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py
def subset_train(self) -> Tuple[np.ndarray, np.ndarray]:

    """Gets Subset for train partition.

    Returns:
        test_inputs: Input variables of training subset for model training
        test_target: Target variable of training subset for model training
    """

    return self._subset_train

TabularDataset

TabularDataset(inputs, target)

Bases: Dataset

Torch based Dataset object to create torch Dataset from given numpy or dataframe type of input and target variables

Parameters:

Name Type Description Default
inputs Union[ndarray, DataFrame, Series]

Input variables that will be passed to network

required
target Union[ndarray, DataFrame, Series]

Target variable for output layer

required

Raises:

Type Description
FedbiomedTorchDatasetError

If input variables and target variable does not have equal length/size

Source code in fedbiomed/common/data/_tabular_dataset.py
def __init__(self,
             inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
             target: Union[np.ndarray, pd.DataFrame, pd.Series]):
    """Constructs PyTorch dataset object

    Args:
        inputs: Input variables that will be passed to network
        target: Target variable for output layer

    Raises:
        FedbiomedTorchDatasetError: If input variables and target variable does not have
            equal length/size
    """

    # Inputs and target variable should be converted to the torch tensors
    # PyTorch provides `from_numpy` function to convert numpy arrays to
    # torch tensor. Therefore, if the arguments `inputs` and `target` are
    # instance one of `pd.DataFrame` or `pd.Series`, they should be converted to
    # numpy arrays
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self.inputs = inputs.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.inputs = inputs
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `inputs` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
    # Configuring self.target attribute
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self.target = target.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.target = target
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `target` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")

    # The lengths should be equal
    if len(self.inputs) != len(self.target):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: Length of input variables and target "
                                                f"variable does not match. Please make sure that they have "
                                                f"equal size while creating the method `training_data` of "
                                                f"TrainingPlan")

    # Convert `inputs` adn `target` to Torch floats
    self.inputs = from_numpy(self.inputs).float()
    self.target = from_numpy(self.target).float()

Attributes

inputs instance-attribute
inputs = float()
target instance-attribute
target = float()

Functions

get_dataset_type staticmethod
get_dataset_type()
Source code in fedbiomed/common/data/_tabular_dataset.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
    return DatasetTypes.TABULAR

TorchDataManager

TorchDataManager(dataset, **kwargs)

Bases: object

Wrapper for PyTorch Dataset to manage loading operations for validation and train.

Parameters:

Name Type Description Default
dataset Dataset

Dataset object for torch.utils.data.DataLoader

required
**kwargs dict

Arguments for PyTorch DataLoader

{}

Raises:

Type Description
FedbiomedTorchDataManagerError

If the argument dataset is not an instance of torch.utils.data.Dataset

Source code in fedbiomed/common/data/_torch_data_manager.py
def __init__(self, dataset: Dataset, **kwargs: dict):
    """Construct  of class

    Args:
        dataset: Dataset object for torch.utils.data.DataLoader
        **kwargs: Arguments for PyTorch `DataLoader`

    Raises:
        FedbiomedTorchDataManagerError: If the argument `dataset` is not an instance of `torch.utils.data.Dataset`
    """

    # TorchDataManager should get `dataset` argument as an instance of torch.utils.data.Dataset
    if not isinstance(dataset, Dataset):
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: The attribute `dataset` should an instance "
            f"of `torch.utils.data.Dataset`, please use `Dataset` as parent class for"
            f"your custom torch dataset object")

    self._dataset = dataset
    self._loader_arguments = kwargs
    self._subset_test: Union[Subset, None] = None
    self._subset_train: Union[Subset, None] = None

Attributes

dataset property
dataset

Gets dataset.

Returns:

Type Description
Dataset

PyTorch dataset instance

Functions

load_all_samples
load_all_samples()

Loading all samples as PyTorch DataLoader without splitting.

Returns:

Type Description
DataLoader

Dataloader for entire datasets. DataLoader arguments will be retrieved from the **kwargs which is defined while initializing the class

Source code in fedbiomed/common/data/_torch_data_manager.py
def load_all_samples(self) -> DataLoader:
    """Loading all samples as PyTorch DataLoader without splitting.

    Returns:
        Dataloader for entire datasets. `DataLoader` arguments will be retrieved from the `**kwargs` which
            is defined while initializing the class
    """
    return self._create_torch_data_loader(self._dataset, **self._loader_arguments)
split
split(test_ratio, test_batch_size)

Splitting PyTorch Dataset into train and validation.

Parameters:

Name Type Description Default
test_ratio float

Split ratio for validation set ratio. Rest of the samples will be used for training

required

Raises: FedbiomedTorchDataManagerError: If the ratio is not in good format

Returns:

Name Type Description
train_loader Union[DataLoader, None]

DataLoader for training subset. None if the test_ratio is 1

test_loader Union[DataLoader, None]

DataLoader for validation subset. None if the test_ratio is 0

Source code in fedbiomed/common/data/_torch_data_manager.py
def split(self, test_ratio: float, test_batch_size: Union[int, None]) -> Tuple[Union[DataLoader, None], Union[DataLoader, None]]:
    """ Splitting PyTorch Dataset into train and validation.

    Args:
         test_ratio: Split ratio for validation set ratio. Rest of the samples will be used for training
    Raises:
        FedbiomedTorchDataManagerError: If the ratio is not in good format

    Returns:
         train_loader: DataLoader for training subset. `None` if the `test_ratio` is `1`
         test_loader: DataLoader for validation subset. `None` if the `test_ratio` is `0`
    """

    # Check the argument `ratio` is of type `float`
    if not isinstance(test_ratio, (float, int)):
        raise FedbiomedTorchDataManagerError(f'{ErrorNumbers.FB608.value}: The argument `ratio` should be '
                                             f'type `float` or `int` not {type(test_ratio)}')

    # Check ratio is valid for splitting
    if test_ratio < 0 or test_ratio > 1:
        raise FedbiomedTorchDataManagerError(f'{ErrorNumbers.FB608.value}: The argument `ratio` should be '
                                             f'equal or between 0 and 1, not {test_ratio}')

    # If `Dataset` has proper data attribute
    # try to get shape from self.data
    if not hasattr(self._dataset, '__len__'):
        raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
                                             f"{str(self._dataset)} without `__len__`.  Please make sure "
                                             f"that `__len__` method has been added to custom dataset. "
                                             f"This method should return total number of samples.")

    try:
        samples = len(self._dataset)
    except AttributeError as e:
        raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
                                             f"{str(self._dataset)} due to undefined attribute, {str(e)}")
    except TypeError as e:
        raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
                                             f"{str(self._dataset)}, {str(e)}")

    # Calculate number of samples for train and validation subsets
    test_samples = math.floor(samples * test_ratio)
    train_samples = samples - test_samples

    self._subset_train, self._subset_test = random_split(self._dataset, [train_samples, test_samples])

    if not test_batch_size:

        test_batch_size = len(self._subset_test)
    loaders = (self._subset_loader(self._subset_train, **self._loader_arguments),
               self._subset_loader(self._subset_test, batch_size = test_batch_size))

    return loaders
subset_test
subset_test()

Gets validation subset of the dataset.

Returns:

Type Description
Subset

Validation subset

Source code in fedbiomed/common/data/_torch_data_manager.py
def subset_test(self) -> Subset:
    """Gets validation subset of the dataset.

    Returns:
        Validation subset
    """

    return self._subset_test
subset_train
subset_train()

Gets train subset of the dataset.

Returns:

Type Description
Subset

Train subset

Source code in fedbiomed/common/data/_torch_data_manager.py
def subset_train(self) -> Subset:
    """Gets train subset of the dataset.

    Returns:
        Train subset
    """
    return self._subset_train
to_sklearn
to_sklearn()

Converts PyTorch Dataset to sklearn data manager of Fed-BioMed.

Returns:

Type Description
SkLearnDataManager

Data manager to use in SkLearn base training plans

Source code in fedbiomed/common/data/_torch_data_manager.py
def to_sklearn(self) -> SkLearnDataManager:
    """Converts PyTorch `Dataset` to sklearn data manager of Fed-BioMed.

    Returns:
        Data manager to use in SkLearn base training plans
    """

    loader = self._create_torch_data_loader(self._dataset, batch_size=len(self._dataset))
    # Iterate over samples and get input variable and target variable
    inputs = next(iter(loader))[0].numpy()
    target = next(iter(loader))[1].numpy()

    return SkLearnDataManager(inputs=inputs, target=target, **self._loader_arguments)

Functions

discover_flamby_datasets

discover_flamby_datasets()

Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.

Returns:

Name Type Description
Dict[int, str]

a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module

Dict[int, str]

corresponding to a dataset, represented as str. To import said module one must prepend with the correct

path Dict[int, str]

import flamby.datasets.dataset_name.

Source code in fedbiomed/common/data/_flamby_dataset.py
def discover_flamby_datasets() -> Dict[int, str]:
    """Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.

    Returns:
        a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module
        corresponding to a dataset, represented as str. To import said module one must prepend with the correct
        path: `import flamby.datasets.dataset_name`.

    """
    dataset_list = [name for _, name, ispkg in pkgutil.iter_modules(flamby_datasets_module.__path__) if ispkg]
    return {i: name for i, name in enumerate(dataset_list)}