Data

Classes that simplify imports from fedbiomed.common.data

Classes

DataLoadingBlock

DataLoadingBlock()

Bases: ABC

The building blocks of a DataLoadingPlan.

A DataLoadingBlock describes an intermediary layer between the researcher and the node's filesystem. It allows the node to specify a customization in the way data is "perceived" by the data loaders during training.

A DataLoadingBlock is identified by its type_id attribute. Thus, this attribute should be unique among all DataLoadingBlockTypes in the same DataLoadingPlan. Moreover, we may test equality between a DataLoadingBlock and a string by checking its type_id, as a means of easily testing whether a DataLoadingBlock is contained in a collection.

Correct usage of this class requires creating ad-hoc subclasses. The DataLoadingBlock class is not intended to be instantiated directly.

Subclasses of DataLoadingBlock must respect the following conditions:

implement a default constructor
the implemented constructor must call super().__init__()
extend the serialize(self) and the deserialize(self, load_from: dict) functions
both serialize and deserialize must call super's serialize and deserialize respectively
the deserialize function must always return self
the serialize function must update the dict returned by super's serialize
implement an apply function that takes arbitrary arguments and applies the logic of the loading_block
update the _validation_scheme to define rules for all new fields returned by the serialize function

Attributes:

Name	Type	Description
`__serialization_id`		(str) identifies one serialized instance of the DataLoadingBlock

Source code in fedbiomed/common/data/_data_loading_plan.py

def __init__(self):
    self.__serialization_id = 'serialized_dlb_' + str(uuid.uuid4())
    self._serialization_validator = SerializationValidation()
    self._serialization_validator.update_validation_scheme(SerializationValidation.dlb_default_scheme())

Functions

apply `abstractmethod`

apply(*args, **kwargs)

Abstract method representing an application of the DataLoadingBlock

Source code in fedbiomed/common/data/_data_loading_plan.py

@abstractmethod
def apply(self, *args, **kwargs):
    """Abstract method representing an application of the DataLoadingBlock
    """
    pass

deserialize

deserialize(load_from)

Reconstruct the DataLoadingBlock from a serialized version.

Parameters:

Name	Type	Description	Default
`load_from`	`dict`	a dictionary as obtained by the serialize function.	required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py

def deserialize(self, load_from: dict) -> TDataLoadingBlock:
    """Reconstruct the DataLoadingBlock from a serialized version.

    Args:
        load_from (dict): a dictionary as obtained by the serialize function.
    Returns:
        the self instance
    """
    self._serialization_validator.validate(load_from, FedbiomedLoadingBlockValueError)
    self.__serialization_id = load_from['dlb_id']
    return self

get_serialization_id

get_serialization_id()

Expose serialization id as read-only

Source code in fedbiomed/common/data/_data_loading_plan.py

def get_serialization_id(self):
    """Expose serialization id as read-only"""
    return self.__serialization_id

instantiate_class `staticmethod`

instantiate_class(loading_block)

Instantiate one DataLoadingBlock object of the type defined in the arguments.

Uses the loading_block_module and loading_block_class fields of the loading_block argument to identify the type of DataLoadingBlock to be instantiated, then calls its default constructor. Note that this function does not call deserialize.

Parameters:

Name	Type	Description	Default
`loading_block`	`dict`	DataLoadingBlock metadata in the format returned by the serialize function.	required

Returns: A default-constructed instance of a DataLoadingBlock of the type defined in the metadata. Raises: FedbiomedLoadingBlockError: if the instantiation process raised any exception.

Source code in fedbiomed/common/data/_data_loading_plan.py

@staticmethod
def instantiate_class(loading_block: dict) -> TDataLoadingBlock:
    """Instantiate one [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    object of the type defined in the arguments.

    Uses the `loading_block_module` and `loading_block_class` fields of the loading_block argument to
    identify the type of [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    to be instantiated, then calls its default constructor.
    Note that this function **does not call deserialize**.

    Args:
        loading_block (dict): [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
            metadata in the format returned by the serialize function.
    Returns:
        A default-constructed instance of a
            [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
            of the type defined in the metadata.
    Raises:
       FedbiomedLoadingBlockError: if the instantiation process raised any exception.
    """
    try:
        dlb_module = import_module(loading_block['loading_block_module'])
        dlb = eval(f"dlb_module.{loading_block['loading_block_class']}()")
    except Exception as e:
        msg = f"{ErrorNumbers.FB614.value}: could not instantiate DataLoadingBlock from the following metadata: " +\
              f"{loading_block} because of {type(e).__name__}: {e}"
        logger.debug(msg)
        raise FedbiomedLoadingBlockError(msg)
    return dlb

instantiate_key `staticmethod`

instantiate_key(key_module, key_classname, loading_block_key_str)

Imports and loads DataLoadingBlockTypes regarding the passed arguments

Parameters:

Name	Type	Description	Default
`key_module`	`str`	description	required
`key_classname`	`str`	description	required
`loading_block_key_str`	`str`	description	required

Raises:

Type	Description
`FedbiomedDataLoadingPlanError`	description

Returns:

Name	Type	Description
`DataLoadingBlockTypes`	`DataLoadingBlockTypes`	description

Source code in fedbiomed/common/data/_data_loading_plan.py

@staticmethod
def instantiate_key(key_module: str, key_classname: str, loading_block_key_str: str) -> DataLoadingBlockTypes:
    """Imports and loads [DataLoadingBlockTypes][fedbiomed.common.constants.DataLoadingBlockTypes]
    regarding the passed arguments

    Args:
        key_module (str): _description_
        key_classname (str): _description_
        loading_block_key_str (str): _description_

    Raises:
        FedbiomedDataLoadingPlanError: _description_

    Returns:
        DataLoadingBlockTypes: _description_
    """
    try:
        keys = import_module(key_module)
        loading_block_key = eval(f"keys.{key_classname}('{loading_block_key_str}')")
    except Exception as e:
        msg = f"{ErrorNumbers.FB615.value} Error deserializing loading block key " + \
              f"{loading_block_key_str} with path {key_module}.{key_classname} " + \
              f"because of {type(e).__name__}: {e}"
        logger.debug(msg)
        raise FedbiomedDataLoadingPlanError(msg)
    return loading_block_key

serialize

serialize()

Serializes the class in a format similar to json.

Returns:

Type	Description
`dict`	a dictionary of key-value pairs sufficient for reconstructing
`dict`	the DataLoadingBlock.

Source code in fedbiomed/common/data/_data_loading_plan.py

def serialize(self) -> dict:
    """Serializes the class in a format similar to json.

    Returns:
        a dictionary of key-value pairs sufficient for reconstructing
        the DataLoadingBlock.
    """
    return dict(
        loading_block_class=self.__class__.__qualname__,
        loading_block_module=self.__module__,
        dlb_id=self.__serialization_id
    )

DataLoadingPlan

DataLoadingPlan(*args, **kwargs)

Bases: Dict[DataLoadingBlockTypes, DataLoadingBlock]

Customizations to the way the data is loaded and presented for training.

A DataLoadingPlan is a dictionary of {name: DataLoadingBlock} pairs. Each DataLoadingBlock represents a customization to the way data is loaded and presented to the researcher. These customizations are defined by the node, but they operate on a Dataset class, which is defined by the library and instantiated by the researcher.

To exploit this functionality, a Dataset must be modified to accept the customizations provided by the DataLoadingPlan. To simplify this process, we provide the DataLoadingPlanMixin class below.

The DataLoadingPlan class should be instantiated directly, no subclassing is needed. The DataLoadingPlan is a dict, and exposes the same interface as a dict.

Attributes:

Name	Type	Description
`dlp_id`		str representing a unique plan id (auto-generated)
`desc`		str representing an optional user-friendly short description
`target_dataset_type`		a DatasetTypes enum representing the type of dataset targeted by this DataLoadingPlan

Source code in fedbiomed/common/data/_data_loading_plan.py

def __init__(self, *args, **kwargs):
    super(DataLoadingPlan, self).__init__(*args, **kwargs)
    self.dlp_id = 'dlp_' + str(uuid.uuid4())
    self.desc = ""
    self.target_dataset_type = DatasetTypes.NONE
    self._serialization_validation = SerializationValidation()
    self._serialization_validation.update_validation_scheme(SerializationValidation.dlp_default_scheme())

Attributes

desc `instance-attribute`

desc = ''

dlp_id `instance-attribute`

dlp_id = 'dlp_' + str(uuid4())

target_dataset_type `instance-attribute`

target_dataset_type = NONE

Functions

deserialize

deserialize(serialized_dlp, serialized_loading_blocks)

Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.

Calling this function will clear the contained [DataLoadingBlockTypes].

This function may not be used to "update" nor to "append to" a DataLoadingPlan.

Parameters:

Name	Type	Description	Default
`serialized_dlp`	`dict`	a dictionary of data loading plan metadata, as obtained from the first output of the serialize function	required
`serialized_loading_blocks`	`List[dict]`	a list of dictionaries of loading_block metadata, as obtained from the second output of the serialize function	required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py

def deserialize(self, serialized_dlp: dict, serialized_loading_blocks: List[dict]) -> TDataLoadingPlan:
    """Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.

    !!! warning "Calling this function will *clear* the contained [DataLoadingBlockTypes]."
        This function may not be used to "update" nor to "append to"
        a [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan].

    Args:
        serialized_dlp: a dictionary of data loading plan metadata, as obtained from the first output of the
            serialize function
        serialized_loading_blocks: a list of dictionaries of loading_block metadata, as obtained from the
            second output of the serialize function
    Returns:
        the self instance
    """
    self._serialization_validation.validate(serialized_dlp, FedbiomedDataLoadingPlanValueError)

    self.clear()
    self.dlp_id = serialized_dlp['dlp_id']
    self.desc = serialized_dlp['dlp_name']
    self.target_dataset_type = DatasetTypes(serialized_dlp['target_dataset_type'])
    for loading_block_key_str, dlb_id in serialized_dlp['loading_blocks'].items():
        key_module, key_classname = serialized_dlp['key_paths'][loading_block_key_str]
        loading_block_key = DataLoadingBlock.instantiate_key(key_module, key_classname, loading_block_key_str)
        loading_block = next(filter(lambda x: x['dlb_id'] == dlb_id,
                                    serialized_loading_blocks))
        dlb = DataLoadingBlock.instantiate_class(loading_block)
        self[loading_block_key] = dlb.deserialize(loading_block)
    return self

infer_dataset_type `staticmethod`

infer_dataset_type(dataset)

Infer the type of a given dataset.

This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to apply some heuristics to guess the type of dataset.

Parameters:

Name	Type	Description	Default
`dataset`	`Any`	the dataset whose type we want to infer.	required

Returns: a DatasetTypes enum element which identifies the type of the dataset. Raises: FedbiomedDataLoadingPlanValueError: if the dataset does not have a get_dataset_type method and moreover the type could not be guessed.

Source code in fedbiomed/common/data/_data_loading_plan.py

@staticmethod
def infer_dataset_type(dataset: Any) -> DatasetTypes:
    """Infer the type of a given dataset.

    This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes
    the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to
    apply some heuristics to guess the type of dataset.

    Args:
        dataset: the dataset whose type we want to infer.
    Returns:
        a DatasetTypes enum element which identifies the type of the dataset.
    Raises:
        FedbiomedDataLoadingPlanValueError: if the dataset does not have a `get_dataset_type` method and moreover
            the type could not be guessed.
    """
    if hasattr(dataset, 'get_dataset_type'):
        return dataset.get_dataset_type()
    elif dataset.__class__.__name__ == 'ImageFolder':
        # ImageFolder could be both an images type or mednist. Try to identify mednist with some heuristic.
        if hasattr(dataset, 'classes') and \
                all([x in dataset.classes for x in ['AbdomenCT', 'BreastMRI', 'CXR', 'ChestCT', 'Hand', 'HeadCT']]):
            return DatasetTypes.MEDNIST
        else:
            return DatasetTypes.IMAGES
    elif dataset.__class__.__name__ == 'MNIST':
        return DatasetTypes.DEFAULT
    msg = f"{ErrorNumbers.FB615.value} Trying to infer dataset type of {dataset} is not supported " + \
        f"for datasets of type {dataset.__class__.__qualname__}"
    logger.debug(msg)
    raise FedbiomedDataLoadingPlanValueError(msg)

serialize

serialize()

Serializes the class in a format similar to json.

Returns:

Type	Description
`Tuple[dict, List]`	a tuple sufficient for reconstructing the DataLoading plan. It includes: - a dictionary of key-value pairs with the DataLoadingPlan parameters. - a list of dict containing the data for reconstruction all the DataLoadingBlock of the DataLoadingPlan

Source code in fedbiomed/common/data/_data_loading_plan.py

def serialize(self) -> Tuple[dict, List]:
    """Serializes the class in a format similar to json.

    Returns:
        a tuple sufficient for reconstructing the DataLoading plan. It includes:
            - a dictionary of key-value pairs with the
            [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] parameters.
            - a list of dict containing the data for reconstruction all the DataLoadingBlock
                of the [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] 
    """
    return dict(
        dlp_id=self.dlp_id,
        dlp_name=self.desc,
        target_dataset_type=self.target_dataset_type.value,
        loading_blocks={key.value: dlb.get_serialization_id() for key, dlb in self.items()},
        key_paths={key.value: (f"{key.__module__}", f"{key.__class__.__qualname__}") for key in self.keys()}
    ), [dlb.serialize() for dlb in self.values()]

DataLoadingPlanMixin

DataLoadingPlanMixin()

Utility class to enable DLP functionality in a dataset.

Any Dataset class that inherits from [DataLoadingPlanMixin] will have the basic tools necessary to support a DataLoadingPlan. Typically, the logic of each specific DataLoadingBlock in the DataLoadingPlan will be implemented in the form of hooks that are called within the Dataset's implementation using the helper function apply_dlb defined below.

Source code in fedbiomed/common/data/_data_loading_plan.py

def __init__(self):
    self._dlp = None

Functions

apply_dlb

apply_dlb(default_ret_value, dlb_key, *args, **kwargs)

Apply one DataLoadingBlock identified by its key.

Note that we want to easily support the case where the DataLoadingPlan is not activated, or the requested loading block is not contained in the DataLoadingPlan. This is achieved by providing a default return value to be returned when the above conditions are met. Hence, most of the calls to apply_dlb will look like this:

value = self.apply_dlb(value, 'my-loading-block', my_apply_args)

This will ensure that value is not changed if the DataLoadingPlan is not active.

Parameters:

Name	Type	Description	Default
`default_ret_value`	`Any`	the value to be returned in case that the dlp functionality is not required	required
`dlb_key`	`DataLoadingBlockTypes`	the key of the DataLoadingBlock to be applied	required
`*args`	`Optional[Any]`	forwarded to the DataLoadingBlock's apply function	`()`
`**kwargs`	`Optional[Any]`	forwarded to the DataLoadingBlock's apply function	`{}`

Returns: the output of the DataLoadingBlock's apply function, or the default_ret_value when dlp is None or it does not contain the requested loading block

Source code in fedbiomed/common/data/_data_loading_plan.py

def apply_dlb(self, default_ret_value: Any, dlb_key: DataLoadingBlockTypes,
              *args: Optional[Any], **kwargs: Optional[Any]) -> Any:
    """Apply one DataLoadingBlock identified by its key.

    Note that we want to easily support the case where the DataLoadingPlan
    is not activated, or the requested loading block is not contained in the
    DataLoadingPlan. This is achieved by providing a default return value
    to be returned when the above conditions are met. Hence, most of the
    calls to apply_dlb will look like this:
    ```
    value = self.apply_dlb(value, 'my-loading-block', my_apply_args)
    ```
    This will ensure that value is not changed if the DataLoadingPlan is
    not active.

    Args:
        default_ret_value: the value to be returned in case that the dlp
            functionality is not required
        dlb_key: the key of the DataLoadingBlock to be applied
        *args: forwarded to the DataLoadingBlock's apply function
        **kwargs: forwarded to the DataLoadingBlock's apply function
    Returns:
        the output of the DataLoadingBlock's apply function, or
            the default_ret_value when dlp is None or it does not contain
            the requested loading block
    """
    if not isinstance(dlb_key, DataLoadingBlockTypes):
        raise FedbiomedDataLoadingPlanValueError(f"Key {dlb_key} is not of enum type DataLoadingBlockTypes"
                                                 f" in DataLoadingPlanMixin.apply_dlb")
    if self._dlp is not None and dlb_key in self._dlp:
        return self._dlp[dlb_key].apply(*args, **kwargs)
    else:
        return default_ret_value

clear_dlp

clear_dlp()

Source code in fedbiomed/common/data/_data_loading_plan.py

def clear_dlp(self):
    self._dlp = None

set_dlp

set_dlp(dlp)

Sets the dlp if the target dataset type is appropriate

Source code in fedbiomed/common/data/_data_loading_plan.py

def set_dlp(self, dlp: DataLoadingPlan):
    """Sets the dlp if the target dataset type is appropriate"""
    if not isinstance(dlp, DataLoadingPlan):
        msg = f"{ErrorNumbers.FB615.value} Trying to set a DataLoadingPlan but the argument is of type " + \
              f"{type(dlp).__name__}"
        logger.debug(msg)
        raise FedbiomedDataLoadingPlanValueError(msg)

    dataset_type = DataLoadingPlan.infer_dataset_type(self)  # `self` here will refer to the Dataset instance
    if dlp.target_dataset_type != DatasetTypes.NONE and dataset_type != dlp.target_dataset_type:
        raise FedbiomedDataLoadingPlanValueError(f"Trying to set {dlp} on dataset of type {dataset_type.value} but "
                                                 f"the target type is {dlp.target_dataset_type}")
    elif dlp.target_dataset_type == DatasetTypes.NONE:
        dlp.target_dataset_type = dataset_type
    self._dlp = dlp

DataManager

DataManager(dataset, target=None, **kwargs)

Bases: object

Factory class that build different data loader/datasets based on the type of dataset. The argument dataset should be provided as torch.utils.data.Dataset object for to be used in PyTorch training.

Parameters:

Name	Type	Description	Default
`dataset`	`Union[ndarray, DataFrame, Series, Dataset]`	Dataset object. It can be an instance, PyTorch Dataset or Tuple.	required
`target`	`Union[ndarray, DataFrame, Series]`	Target variable or variables.	`None`
`**kwargs`	`dict`	Additional parameters that are going to be used for data loader	`{}`

Source code in fedbiomed/common/data/_data_manager.py

def __init__(self,
             dataset: Union[np.ndarray, pd.DataFrame, pd.Series, Dataset],
             target: Union[np.ndarray, pd.DataFrame, pd.Series] = None,
             **kwargs: dict) -> None:

    """Constructor of DataManager,

    Args:
        dataset: Dataset object. It can be an instance, PyTorch Dataset or Tuple.
        target: Target variable or variables.
        **kwargs: Additional parameters that are going to be used for data loader
    """

    # TODO: Improve datamanager for auto loading by given dataset_path and other information
    # such as inputs variable indexes and target variables indexes

    self._dataset = dataset
    self._target = target
    self._loader_arguments: Dict = kwargs
    self._data_manager_instance = None

Functions

extend_loader_args

extend_loader_args(extension)

Extends the class' loader arguments

Extends the class's _loader_arguments attribute with additional key-values from the extension argument. If a key already exists in the _loader_arguments, then it is not replaced.

Parameters:

Name	Type	Description	Default
`extension`	`Optional[Dict]`	the mapping used to extend the loader arguments	required

Source code in fedbiomed/common/data/_data_manager.py

def extend_loader_args(self, extension: Optional[Dict]):
    """Extends the class' loader arguments

    Extends the class's `_loader_arguments` attribute with additional key-values from
    the `extension` argument. If a key already exists in the `_loader_arguments`, then
    it is not replaced.

    Args:
        extension: the mapping used to extend the loader arguments
    """
    if extension:
        self._loader_arguments.update(
            {key: value for key, value in extension.items() if key not in self._loader_arguments}
        )

load

load(tp_type)

Loads proper DataManager based on given TrainingPlan and dataset, target attributes.

Parameters:

Name	Type	Description	Default
`tp_type`	`TrainingPlans`	Enumeration instance of TrainingPlans that stands for type of training plan.	required

Raises:

Type	Description
`FedbiomedDataManagerError`	If requested DataManager does not match with given arguments.

Source code in fedbiomed/common/data/_data_manager.py

def load(self, tp_type: TrainingPlans):
    """Loads proper DataManager based on given TrainingPlan and
    `dataset`, `target` attributes.

    Args:
        tp_type: Enumeration instance of TrainingPlans that stands for type of training plan.

    Raises:
        FedbiomedDataManagerError: If requested DataManager does not match with given arguments.

    """

    # Training plan is type of TorcTrainingPlan
    if tp_type == TrainingPlans.TorchTrainingPlan:
        if self._target is None and isinstance(self._dataset, Dataset):
            # Create Dataset for pytorch
            self._data_manager_instance = TorchDataManager(dataset=self._dataset, **self._loader_arguments)
        elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
                isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
            # If `dataset` and `target` attributes are array-like object
            # create TabularDataset object to instantiate a TorchDataManager
            torch_dataset = TabularDataset(inputs=self._dataset, target=self._target)
            self._data_manager_instance = TorchDataManager(dataset=torch_dataset, **self._loader_arguments)
        else:
            raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Invalid arguments for torch based "
                                            f"training plan, either provide the argument  `dataset` as PyTorch "
                                            f"Dataset instance, or provide `dataset` and `target` arguments as "
                                            f"an instance one of pd.DataFrame, pd.Series or np.ndarray ")

    elif tp_type == TrainingPlans.SkLearnTrainingPlan:
        # Try to convert `torch.utils.Data.Dataset` to SkLearnBased dataset/datamanager
        if self._target is None and isinstance(self._dataset, Dataset):
            torch_data_manager = TorchDataManager(dataset=self._dataset)
            try:
                self._data_manager_instance = torch_data_manager.to_sklearn()
            except Exception as e:
                raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: PyTorch based `Dataset` object "
                                                "has been instantiated with DataManager. An error occurred while"
                                                "trying to convert torch.utils.data.Dataset to numpy based "
                                                f"dataset: {str(e)}")

        # For scikit-learn based training plans, the arguments `dataset` and `target` should be an instance
        # one of `pd.DataFrame`, `pd.Series`, `np.ndarray`
        elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
                isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
            # Create Dataset for SkLearn training plans
            self._data_manager_instance = SkLearnDataManager(inputs=self._dataset, target=self._target,
                                                             **self._loader_arguments)
        else:
            raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: The argument `dataset` and `target` "
                                            f"should be instance of pd.DataFrame, pd.Series or np.ndarray ")
    else:
        raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Undefined training plan")

MapperBlock

MapperBlock()

Bases: DataLoadingBlock

A DataLoadingBlock for mapping values.

This DataLoadingBlock can be used whenever an "indirect mapping" is needed. For example, it can be used to implement a correspondence between a set of "logical" abstract names and a set of folder names on the filesystem.

The apply function of this DataLoadingBlock takes a "key" as input (a str) and returns the mapped value corresponding to map[key]. Note that while the constructor of this class sets a value for type_id, developers are recommended to set a more meaningful value that better speaks to their application.

Multiple instances of this loading_block may be used in the same DataLoadingPlan, provided that they are given different type_id via the constructor.

Source code in fedbiomed/common/data/_data_loading_plan.py

def __init__(self):
    super(MapperBlock, self).__init__()
    self.map = {}
    self._serialization_validator.update_validation_scheme(MapperBlock._extra_validation_scheme())

Attributes

map `instance-attribute`

map = {}

Functions

apply

apply(key)

Returns the value mapped to the key, if it exists.

Raises:

Type	Description
`FedbiomedLoadingBlockError`	if map is not a dict or the key does not exist.

Source code in fedbiomed/common/data/_data_loading_plan.py

def apply(self, key):
    """Returns the value mapped to the key, if it exists.

    Raises:
        FedbiomedLoadingBlockError: if map is not a dict or the key does not exist.
    """
    if not isinstance(self.map, dict) or key not in self.map:
        msg = f"{ErrorNumbers.FB614.value} Mapper block error: no key '{key}' in mapping dictionary"
        logger.debug(msg)
        raise FedbiomedLoadingBlockError(msg)
    return self.map[key]

deserialize

deserialize(load_from)

Reconstruct the DataLoadingBlock from a serialized version.

Parameters:

Name	Type	Description	Default
`load_from`	`dict`	a dictionary as obtained by the serialize function.	required

Returns: the self instance

Source code in fedbiomed/common/data/_data_loading_plan.py

def deserialize(self, load_from: dict) -> DataLoadingBlock:
    """Reconstruct the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
    from a serialized version.

    Args:
        load_from (dict): a dictionary as obtained by the serialize function.
    Returns:
        the self instance
    """
    super(MapperBlock, self).deserialize(load_from)
    self.map = load_from['map']
    return self

serialize

serialize()

Serializes the class in a format similar to json.

Returns:

Type	Description
`dict`	a dictionary of key-value pairs sufficient for reconstructing
`dict`	the DataLoadingBlock.

Source code in fedbiomed/common/data/_data_loading_plan.py

def serialize(self) -> dict:
    """Serializes the class in a format similar to json.

    Returns:
        a dictionary of key-value pairs sufficient for reconstructing
        the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock].
    """
    ret = super(MapperBlock, self).serialize()
    ret.update({'map': self.map})
    return ret

MedicalFolderBase

MedicalFolderBase(root=None)

Bases: DataLoadingPlanMixin

Controller class for Medical Folder dataset.

Contains methods to validate the MedicalFolder folder hierarchy and extract folder-base metadata information such as modalities, number of subject etc.

Parameters:

Name	Type	Description	Default
`root`	`Union[str, Path, None]`	path to Medical Folder root folder.	`None`

Source code in fedbiomed/common/data/_medical_datasets.py

def __init__(self, root: Union[str, Path, None] = None):
    """Constructs MedicalFolderBase

    Args:
        root: path to Medical Folder root folder.
    """
    super(MedicalFolderBase, self).__init__()

    if root is not None:
        root = self.validate_MedicalFolder_root_folder(root)

    self._root = root

Attributes

default_modality_names `class-attribute` `instance-attribute`

default_modality_names = ['T1', 'T2', 'label']

root `property` `writable`

root

Root property of MedicalFolderController

Functions

available_subjects

available_subjects(subjects_from_index, subjects_from_folder=None)

Checks missing subject folders and missing entries in demographics

Parameters:

Name	Type	Description	Default
`subjects_from_index`	`Union[list, Series]`	Given subject folder names in demographics	required
`subjects_from_folder`	`list`	List of subject folder names to get intersection of given subject_from_index	`None`

Returns:

Name	Type	Description
`available_subjects`	`list[str]`	subjects that have an imaging data folder and are also present in the demographics file
`missing_subject_folders`	`list[str]`	subjects that are in the demographics file but do not have an imaging data folder
`missing_entries`	`list[str]`	subjects that have an imaging data folder but are not present in the demographics file

Source code in fedbiomed/common/data/_medical_datasets.py

def available_subjects(
    self,
    subjects_from_index: Union[list, pd.Series],
    subjects_from_folder: list = None,
) -> tuple[list[str], list[str], list[str]]:
    """Checks missing subject folders and missing entries in demographics

    Args:
        subjects_from_index: Given subject folder names in demographics
        subjects_from_folder: List of subject folder names to get intersection of given subject_from_index

    Returns:
        available_subjects: subjects that have an imaging data folder and are also present in the demographics file
        missing_subject_folders: subjects that are in the demographics file but do not have an imaging data folder
        missing_entries: subjects that have an imaging data folder but are not present in the demographics file
    """

    # Select all subject folders if it is not given
    if subjects_from_folder is None:
        subjects_from_folder = self.subjects_with_imaging_data_folders()

    # Missing subject that will cause warnings
    missing_subject_folders = list(
        set(subjects_from_index) - set(subjects_from_folder)
    )

    # Missing entries that will cause errors
    missing_entries = list(set(subjects_from_folder) - set(subjects_from_index))

    # Intersection
    available_subjects = list(
        set(subjects_from_index).intersection(set(subjects_from_folder))
    )

    return available_subjects, missing_subject_folders, missing_entries

complete_subjects

complete_subjects(subjects, modalities)

Retrieves subjects that have given all the modalities.

Parameters:

Name	Type	Description	Default
`subjects`	`List[str]`	List of subject folder names	required
`modalities`	`List[str]`	List of required modalities	required

Returns:

Type	Description
`List[str]`	List of subject folder names that have required modalities

Source code in fedbiomed/common/data/_medical_datasets.py

def complete_subjects(
    self, subjects: List[str], modalities: List[str]
) -> List[str]:
    """Retrieves subjects that have given all the modalities.

    Args:
        subjects: List of subject folder names
        modalities: List of required modalities

    Returns:
        List of subject folder names that have required modalities
    """
    return [
        subject
        for subject in subjects
        if all(self.is_modalities_existing(subject, modalities))
    ]

demographics_column_names `staticmethod`

demographics_column_names(path)

Source code in fedbiomed/common/data/_medical_datasets.py

@staticmethod
def demographics_column_names(path: Union[str, Path]):
    return MedicalFolderBase.read_demographics(path).columns.values

get_dataset_type `staticmethod`

get_dataset_type()

Source code in fedbiomed/common/data/_medical_datasets.py

@staticmethod
def get_dataset_type() -> DatasetTypes:
    return DatasetTypes.MEDICAL_FOLDER

is_modalities_existing

is_modalities_existing(subject, modalities)

Checks whether given modalities exists in the subject directory

Parameters:

Name	Type	Description	Default
`subject`	`str`	Subject ID or subject folder name	required
`modalities`	`List[str]`	List of modalities to check	required

Returns:

Type	Description
`List[bool]`	List of `bool` that represents whether modality is existing respectively for each of modality.

Raises:

Type	Description
`FedbiomedDatasetError`	bad argument type

Source code in fedbiomed/common/data/_medical_datasets.py

def is_modalities_existing(self, subject: str, modalities: List[str]) -> List[bool]:
    """Checks whether given modalities exists in the subject directory

    Args:
        subject: Subject ID or subject folder name
        modalities: List of modalities to check

    Returns:
        List of `bool` that represents whether modality is existing respectively for each of modality.

    Raises:
        FedbiomedDatasetError: bad argument type
    """
    if not isinstance(subject, str):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Expected string for subject folder/ID, "
            f"but got {type(subject)}"
        )
    if not isinstance(modalities, list):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Expected a list for modalities, "
            f"but got {type(modalities)}"
        )
    if not all([type(m) is str for m in modalities]):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Expected a list of string for modalities, "
            f"but some modalities are "
            f"{' '.join([str(type(m) for m in modalities if isinstance(m, str))])}"
        )
    are_modalities_existing = list()
    for modality in modalities:
        modality_folder = self._subject_modality_folder(subject, modality)
        are_modalities_existing.append(
            bool(modality_folder)
            and self._root.joinpath(subject, modality_folder).is_dir()
        )
    return are_modalities_existing

modalities

modalities()

Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.

Returns:

Type	Description
`list`	List of unique available modalities
`list`	List of all encountered modality folders in each subject folder, appearing once per folder

Source code in fedbiomed/common/data/_medical_datasets.py

def modalities(self) -> Tuple[list, list]:
    """Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.

    Returns:
         List of unique available modalities
         List of all encountered modality folders in each subject folder, appearing once per folder
    """
    modality_candidates, modality_folders_list = (
        self.modalities_candidates_from_subfolders()
    )
    if (
        self._dlp is not None
        and MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS in self._dlp
    ):
        modalities = list(
            self._dlp[
                MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS
            ].map.keys()
        )
        return modalities, modality_folders_list
    else:
        return modality_candidates, modality_folders_list

modalities_candidates_from_subfolders

modalities_candidates_from_subfolders()

Gets all possible modality folders under root directory

Returns:

Type	Description
`list`	List of unique available modality folders appearing at least once
`list`	List of all encountered modality folders in each subject folder, appearing once per folder

Source code in fedbiomed/common/data/_medical_datasets.py

def modalities_candidates_from_subfolders(self) -> Tuple[list, list]:
    """Gets all possible modality folders under root directory

    Returns:
         List of unique available modality folders appearing at least once
         List of all encountered modality folders in each subject folder, appearing once per folder
    """

    # Accept only folders that don't start with "." and "_"
    modalities = [
        f.name
        for f in self._root.glob("*/*")
        if f.is_dir() and not f.name.startswith((".", "_"))
    ]
    return sorted(list(set(modalities))), modalities

read_demographics `staticmethod`

read_demographics(path, index_col=None)

Read demographics tabular file for Medical Folder dataset

Raises:

Type	Description
`FedbiomedDatasetError`	bad file format or a drectory

Source code in fedbiomed/common/data/_medical_datasets.py

@staticmethod
def read_demographics(path: Union[str, Path], index_col: Optional[int] = None):
    """Read demographics tabular file for Medical Folder dataset

    Raises:
        FedbiomedDatasetError: bad file format or a drectory
    """
    path = Path(path)
    if not path.is_file():
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Demographics should be , but got a folder {path.name}"
        )
    if path.suffix.lower() not in (
        ".csv",
        ".tsv",
    ):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Demographics should be CSV or TSV files"
        )

    # thank you [HalvarLilienthal](https://github.com/HalvarLilienthal) for this solution
    return pd.read_csv(
        path, index_col=index_col, dtype={index_col: str}, engine="python"
    )

subjects_with_imaging_data_folders

subjects_with_imaging_data_folders()

Retrieves subject folder names under Medical Folder root directory.

Returns:

Type	Description
`List[str]`	subject folder names under Medical Folder root directory.

Source code in fedbiomed/common/data/_medical_datasets.py

def subjects_with_imaging_data_folders(self) -> List[str]:
    """Retrieves subject folder names under Medical Folder root directory.

    Returns:
        subject folder names under Medical Folder root directory.
    """
    return [
        f.name
        for f in self._root.iterdir()
        if f.is_dir() and not f.name.startswith(".")
    ]

validate_MedicalFolder_root_folder `staticmethod`

validate_MedicalFolder_root_folder(path)

Validates Medical Folder root directory by checking folder structure

Parameters:

Name	Type	Description	Default
`path`	`Union[str, Path]`	path to root directory	required

Returns:

Type	Description
`Path`	Path to root folder of Medical Folder dataset

Raises:

Type	Description
`FedbiomedDatasetError`	If path is not an instance of `str` or `pathlib.Path` - If path is not a directory

Source code in fedbiomed/common/data/_medical_datasets.py

@staticmethod
def validate_MedicalFolder_root_folder(path: Union[str, Path]) -> Path:
    """Validates Medical Folder root directory by checking folder structure

    Args:
        path: path to root directory

    Returns:
        Path to root folder of Medical Folder dataset

    Raises:
        FedbiomedDatasetError: - If path is not an instance of `str` or `pathlib.Path`
                               - If path is not a directory
    """
    if not isinstance(path, (Path, str)):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: The argument root should an instance of "
            f"`Path` or `str`, but got {type(path)}"
        )

    if not isinstance(path, Path):
        path = Path(path)

    path = Path(path).expanduser().resolve()

    if not path.exists():
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Folder or file {path} not found on system"
        )
    if not path.is_dir():
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Root for Medical Folder dataset "
            f"should be a directory."
        )

    directories = [f for f in path.iterdir() if f.is_dir()]
    if len(directories) == 0:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Root folder of Medical Folder should "
            f"contain subject folders, but no sub folder has been found. "
        )

    modalities = [f for f in path.glob("*/*") if f.is_dir()]
    if len(modalities) == 0:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value} Subject folders for Medical Folder should "
            f"contain modalities as folders. Folder structure should be "
            f"root/<subjects>/<modalities>"
        )

    return path

MedicalFolderController

MedicalFolderController(root=None)

Bases: MedicalFolderBase

Utility class to construct and verify Medical Folder datasets without knowledge of the experiment.

The purpose of this class is to enable key functionalities related to the MedicalFolderDataset at the time of dataset deployment, i.e. when the data is being added to the node's database.

Specifically, the MedicalFolderController class can be used to: - construct a MedicalFolderDataset with all available data modalities, without knowing which ones will be used as targets or features during an experiment - validate that the proper folder structure has been respected by the data managers preparing the data - identify which subjects have which modalities

Parameters:

Name	Type	Description	Default
`root`	`str`	Folder path to dataset. Defaults to None.	`None`

Source code in fedbiomed/common/data/_medical_datasets.py

def __init__(self, root: str = None):
    """Constructs MedicalFolderController

    Args:
        root: Folder path to dataset. Defaults to None.
    """
    super(MedicalFolderController, self).__init__(root=root)

Functions

load_MedicalFolder

load_MedicalFolder(tabular_file=None, index_col=None)

Load Medical Folder dataset with given tabular_file and index_col

Parameters:

Name	Type	Description	Default
`tabular_file`	`Union[str, Path]`	File path to demographics data set	`None`
`index_col`	`Union[str, int]`	Column index that represents subject folder names	`None`

Returns:

Type	Description
`MedicalFolderDataset`	MedicalFolderDataset object

Raises:

Type	Description
`FedbiomedDatasetError`	If Medical Folder dataset is not successfully loaded

Source code in fedbiomed/common/data/_medical_datasets.py

def load_MedicalFolder(
    self, tabular_file: Union[str, Path] = None, index_col: Union[str, int] = None
) -> MedicalFolderDataset:
    """Load Medical Folder dataset with given tabular_file and index_col

    Args:
        tabular_file: File path to demographics data set
        index_col: Column index that represents subject folder names

    Returns:
        MedicalFolderDataset object

    Raises:
        FedbiomedDatasetError: If Medical Folder dataset is not successfully loaded
    """
    if self._root is None:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Can not load Medical Folder dataset without "
            f"declaring root directory. Please set root or build MedicalFolderController "
            f"with by providing `root` argument use"
        )

    modalities, _ = self.modalities()

    try:
        dataset = MedicalFolderDataset(
            root=self._root,
            tabular_file=tabular_file,
            index_col=index_col,
            data_modalities=modalities,
            target_modalities=modalities,
        )
    except FedbiomedError as e:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Can not create Medical Folder dataset. {e}"
        )

    if self._dlp is not None:
        dataset.set_dlp(self._dlp)
    return dataset

subject_modality_status

subject_modality_status(index=None)

Scans subjects and checks which modalities are existing for each subject

Parameters:

Name	Type	Description	Default
`index`	`Union[List, Series]`	Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject folder names. Defaults to None.	`None`

Returns: Modality status for each subject that indicates which modalities are available

Source code in fedbiomed/common/data/_medical_datasets.py

def subject_modality_status(self, index: Union[List, pd.Series] = None) -> Dict:
    """Scans subjects and checks which modalities are existing for each subject

    Args:
        index: Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject
            folder names. Defaults to None.
    Returns:
        Modality status for each subject that indicates which modalities are available
    """

    modalities, _ = self.modalities()
    subjects = self.subjects_with_imaging_data_folders()
    modality_status = {"columns": [*modalities], "data": [], "index": []}

    if index is not None:
        _, missing_subjects, missing_entries = self.available_subjects(
            subjects_from_index=index
        )
        modality_status["columns"].extend(["in_folder", "in_index"])

    for subject in subjects:
        modality_report = self.is_modalities_existing(subject, modalities)
        status_list = [status for status in modality_report]
        if index is not None:
            status_list.append(False if subject in missing_subjects else True)
            status_list.append(False if subject in missing_entries else True)

        modality_status["data"].append(status_list)
        modality_status["index"].append(subject)

    return modality_status

MedicalFolderDataset

MedicalFolderDataset(root, data_modalities='T1', transform=None, target_modalities='label', target_transform=None, demographics_transform=None, tabular_file=None, index_col=None)

Bases: Dataset, MedicalFolderBase

Torch dataset following the Medical Folder Structure.

The Medical Folder structure is loosely inspired by the BIDS standard [1]. It should respect the following pattern:

└─ MedicalFolder_root/
    └─ demographics.csv
    └─ sub-01/
        ├─ T1/
        │  └─ sub-01_xxx.nii.gz
        └─ T2/
            ├─ sub-01_xxx.nii.gz

where the first-level subfolders or the root correspond to the subjects, and each subject's folder contains subfolders for each imaging modality. Images should be in Nifti format, with either the .nii or .nii.gz extensions. Finally, within the root folder there should also be a demographics file containing at least one index column with the names of the subject folders. This column will be used to explore the data and load the images. The demographics file may contain additional information about each subject and will be loaded alongside the images by our framework.

[1] https://bids.neuroimaging.io/

Parameters:

Name	Type	Description	Default
`root`	`Union[str, PathLike, Path]`	Root folder containing all the subject directories.	required
`data_modalities`	`(str, Iterable)`	Modality or modalities to be used as data sources.	`'T1'`
`transform`	`Union[Callable, Dict[str, Callable]]`	A function or dict of function transform(s) that preprocess each data source.	`None`
`target_modalities`	`Optional[Union[str, Iterable[str]]]`	(str, Iterable): Modality or modalities to be used as target sources.	`'label'`
`target_transform`	`Union[Callable, Dict[str, Callable]]`	A function or dict of function transform(s) that preprocess each target source.	`None`
`demographics_transform`	`Optional[Callable]`	TODO	`None`
`tabular_file`	`Union[str, PathLike, Path, None]`	Path to a CSV or Excel file containing the demographic information from the patients.	`None`
`index_col`	`Union[int, str, None]`	Column name in the tabular file containing the subject ids which mush match the folder names.	`None`

Source code in fedbiomed/common/data/_medical_datasets.py

def __init__(
    self,
    root: Union[str, PathLike, Path],
    data_modalities: Optional[Union[str, Iterable[str]]] = "T1",
    transform: Union[Callable, Dict[str, Callable]] = None,
    target_modalities: Optional[Union[str, Iterable[str]]] = "label",
    target_transform: Union[Callable, Dict[str, Callable]] = None,
    demographics_transform: Optional[Callable] = None,
    tabular_file: Union[str, PathLike, Path, None] = None,
    index_col: Union[int, str, None] = None,
):
    """Constructor for class `MedicalFolderDataset`.

    Args:
        root: Root folder containing all the subject directories.
        data_modalities (str, Iterable): Modality or modalities to be used as data sources.
        transform: A function or dict of function transform(s) that preprocess each data source.
        target_modalities: (str, Iterable): Modality or modalities to be used as target sources.
        target_transform: A function or dict of function transform(s) that preprocess each target source.
        demographics_transform: TODO
        tabular_file: Path to a CSV or Excel file containing the demographic information from the patients.
        index_col: Column name in the tabular file containing the subject ids which mush match the folder names.
    """
    super(MedicalFolderDataset, self).__init__(root=root)

    self._tabular_file = tabular_file
    self._index_col = index_col

    self._data_modalities = (
        [data_modalities] if isinstance(data_modalities, str) else data_modalities
    )
    self._target_modalities = (
        [target_modalities]
        if isinstance(target_modalities, str)
        else target_modalities
    )

    self._transform = self._check_and_reformat_transforms(
        transform, data_modalities
    )
    self._target_transform = self._check_and_reformat_transforms(
        target_transform, target_modalities
    )
    self._demographics_transform = (
        demographics_transform
        if demographics_transform is not None
        else lambda x: {}
    )

    # Image loader
    self._reader = Compose([LoadImage(ITKReader(), image_only=True), ToTensor()])

Attributes

ALLOWED_EXTENSIONS `class-attribute` `instance-attribute`

ALLOWED_EXTENSIONS = ['.nii', '.nii.gz']

demographics `cached` `property`

demographics

Loads tabular data file (supports excel, csv, tsv and colon separated value files).

index_col `property` `writable`

index_col

Getter/setter of the column containing folder's name (in the tabular file)

subjects_has_all_modalities `property`

subjects_has_all_modalities

Gets only the subjects that have all required modalities

subjects_registered_in_demographics `cached` `property`

subjects_registered_in_demographics

Gets the subject only those who are present in the demographics file.

tabular_file `property` `writable`

tabular_file

Functions

get_nontransformed_item

get_nontransformed_item(item)

Source code in fedbiomed/common/data/_medical_datasets.py

def get_nontransformed_item(self, item):
    # For the first item retrieve complete subject folders
    subjects = self.subject_folders()

    if not subjects:
        # case where subjects is an empty list (subject folders have not been found)
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Cannot find complete subject folders with all the modalities"
        )
    # Get subject folder
    subject_folder = subjects[item]

    # Load data modalities
    data = self.load_images(subject_folder, modalities=self._data_modalities)

    # Load target modalities
    targets = self.load_images(subject_folder, modalities=self._target_modalities)

    # Demographics
    demographics = self._get_from_demographics(subject_id=subject_folder.name)
    return (data, demographics), targets

load_images

load_images(subject_folder, modalities)

Loads modality images in given subject folder

Parameters:

Name	Type	Description	Default
`subject_folder`	`Path`	Subject folder where modalities are stored	required
`modalities`	`list`	List of available modalities	required

Returns:

Type	Description
`Dict[str, Tensor]`	Subject image data as victories where keys represent each modality.

Source code in fedbiomed/common/data/_medical_datasets.py

def load_images(
    self, subject_folder: Path, modalities: list
) -> Dict[str, torch.Tensor]:
    """Loads modality images in given subject folder

    Args:
        subject_folder: Subject folder where modalities are stored
        modalities: List of available modalities

    Returns:
        Subject image data as victories where keys represent each modality.
    """
    # FIXME: improvment suggestion of this function at #1279

    subject_data = {}

    for modality in modalities:
        modality_folder = self._subject_modality_folder(subject_folder, modality)
        image_folder = subject_folder.joinpath(modality_folder)
        nii_files = [p.resolve() for p in image_folder.glob("**/*")]

        # Load the first, we assume there is going to be a single image per modality for now.

        nii_files = tuple(
            img
            for img in nii_files
            if any(str(img).endswith(fmt) for fmt in self.ALLOWED_EXTENSIONS)
        )
        if len(nii_files) < 1:
            raise FedbiomedDatasetError(
                f"{ErrorNumbers.FB613.value}: folder {os.path.join(image_folder, modality)}"
                " is empty, but should contain an niftii image. Aborting"
            )

        elif len(nii_files) > 1:
            raise FedbiomedDatasetError(
                f"{ErrorNumbers.FB613.value}: more than one niftii file has been detected"
                " {', '.join(tuple(str(f) for f in nii_files))}. "
                "\nThere should be only one niftii image per modality. Aborting"
            )

        img_path = nii_files[0]
        img = self._reader(img_path)
        subject_data[modality] = img

    return subject_data

set_dataset_parameters

set_dataset_parameters(parameters)

Sets dataset parameters.

Parameters:

Name	Type	Description	Default
`parameters`	`dict`	Parameters to initialize	required

Raises:

Type	Description
`FedbiomedDatasetError`	If given parameters are not of `dict` type

Source code in fedbiomed/common/data/_medical_datasets.py

def set_dataset_parameters(self, parameters: dict):
    """Sets dataset parameters.

    Args:
        parameters: Parameters to initialize

    Raises:
        FedbiomedDatasetError: If given parameters are not of `dict` type
    """
    if not isinstance(parameters, dict):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB613.value}: Expected type for `parameters` is `dict, "
            f"but got {type(parameters)}`"
        )

    for key, value in parameters.items():
        if hasattr(self, key):
            setattr(self, key, value)
        else:
            raise FedbiomedDatasetError(
                f"{ErrorNumbers.FB613.value}: Trying to set non existing attribute '{key}'"
            )

shape

shape()

Retrieves shape information for modalities and demographics csv

Source code in fedbiomed/common/data/_medical_datasets.py

def shape(self) -> dict:
    """Retrieves shape information for modalities and demographics csv"""

    # Get all modalities
    data_modalities = list(set(self._data_modalities))
    target_modalities = list(set(self._target_modalities))
    modalities = list(set(self._data_modalities + self._target_modalities))
    (image, _), targets = self.get_nontransformed_item(0)

    result = {modality: list(image[modality].shape) for modality in data_modalities}

    result.update(
        {modality: list(targets[modality].shape) for modality in target_modalities}
    )
    num_modalities = len(modalities)
    demographics_shape = (
        self.demographics.shape if self.demographics is not None else None
    )
    result.update(
        {"demographics": demographics_shape, "num_modalities": num_modalities}
    )

    return result

subject_folders

subject_folders()

Retrieves subject folder names of only those who have their complete modalities

Returns:

Type	Description
`List[Path]`	List of subject directories that has all requested modalities

Source code in fedbiomed/common/data/_medical_datasets.py

def subject_folders(self) -> List[Path]:
    """Retrieves subject folder names of only those who have their complete modalities

    Returns:
        List of subject directories that has all requested modalities
    """

    # If demographics are present
    if self._tabular_file and self._index_col is not None:
        complete_subject_folders = self.subjects_registered_in_demographics
    else:
        complete_subject_folders = self.subjects_has_all_modalities

    return [self._root.joinpath(folder) for folder in complete_subject_folders]

MedicalFolderLoadingBlockTypes

MedicalFolderLoadingBlockTypes(*args)

Bases: DataLoadingBlockTypes, Enum

Source code in fedbiomed/common/constants.py

def __init__(self, *args):
    cls = self.__class__
    if not isinstance(self.value, str):
        raise ValueError("all fields of DataLoadingBlockTypes subclasses"
                         " must be of str type")
    if any(self.value == e.value for e in cls):
        a = self.name
        e = cls(self.value).name
        raise ValueError(
            f"duplicate values not allowed in DataLoadingBlockTypes and "
            f"its subclasses: {a} --> {e}")

Attributes

MODALITIES_TO_FOLDERS `class-attribute` `instance-attribute`

MODALITIES_TO_FOLDERS = 'modalities_to_folders'

NIFTIFolderDataset

NIFTIFolderDataset(root, transform=None, target_transform=None)

Bases: Dataset

A Generic class for loading NIFTI Images using the folder structure as the target classes' labels.

Supported formats: - NIFTI and compressed NIFTI files: .nii, .nii.gz

This is a Dataset useful in classification tasks. Its usage is quite simple, quite similar to torchvision.datasets.ImageFolder. Images must be contained in first level sub-folders (level 2+ sub-folders are ignored) that describe the target class they belong to (target class label is the name of the folder).

nifti_dataset_root_folder
├── control_group
│   ├── subject_1.nii
│   └── subject_2.nii
│   └── ...
└── disease_group
    ├── subject_3.nii
    └── subject_4.nii
    └── ...

In this example, there are 4 samples (one from each *.nii file), 2 target class, with labels control_group and disease_group. subject_1.nii has class label control_group, subject_3.nii has class label disease_group,etc.

Parameters:

Name	Type	Description	Default
`root`	`Union[str, PathLike, Path]`	folder where the data is located.	required
`transform`	`Union[Callable, None]`	transforms to be applied on data.	`None`
`target_transform`	`Union[Callable, None]`	transforms to be applied on target indexes.	`None`

Raises:

Type	Description
`FedbiomedDatasetError`	bad argument type
`FedbiomedDatasetError`	bad root path

Source code in fedbiomed/common/data/_medical_datasets.py

def __init__(
    self,
    root: Union[str, PathLike, Path],
    transform: Union[Callable, None] = None,
    target_transform: Union[Callable, None] = None,
):
    """Constructor of the class

    Args:
        root: folder where the data is located.
        transform: transforms to be applied on data.
        target_transform: transforms to be applied on target indexes.

    Raises:
        FedbiomedDatasetError: bad argument type
        FedbiomedDatasetError: bad root path
    """
    # check parameters type
    for tr, trname in (
        (transform, "transform"),
        (target_transform, "target_transform"),
    ):
        if not callable(tr) and tr is not None:
            raise FedbiomedDatasetError(
                f"{ErrorNumbers.FB612.value}: Parameter {trname} has incorrect "
                f"type {type(tr)}, cannot create dataset."
            )

    if (
        not isinstance(root, str)
        and not isinstance(root, PathLike)
        and not isinstance(root, Path)
    ):
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB612.value}: Parameter `root` has incorrect type "
            f"{type(root)}, cannot create dataset."
        )

    # initialize object variables
    self._files = []
    self._class_labels = []
    self._targets = []

    try:
        self._root_dir = Path(root).expanduser()
    except RuntimeError as e:
        raise FedbiomedDatasetError(
            f"{ErrorNumbers.FB612.value}: Cannot expand path {root}, error message is: {e}"
        )

    self._transform = transform
    self._target_transform = target_transform
    self._reader = Compose([LoadImage(ITKReader(), image_only=True), ToTensor()])

    self._explore_root_folder()

Functions

files

files()

Retrieves the paths to the sample images.

Gives sames order as when retrieving the sample images (eg self.files[0] is the path to self.__getitem__[0])

Returns:

Type	Description
`List[Path]`	List of the absolute paths to the sample images

Source code in fedbiomed/common/data/_medical_datasets.py

def files(self) -> List[Path]:
    """Retrieves the paths to the sample images.

    Gives sames order as when retrieving the sample images (eg `self.files[0]`
    is the path to `self.__getitem__[0]`)

    Returns:
        List of the absolute paths to the sample images
    """
    return self._files

labels

labels()

Retrieves the labels of the target classes.

Target label index is the index of the corresponding label in this list.

Returns:

Type	Description
`List[str]`	List of the labels of the target classes.

Source code in fedbiomed/common/data/_medical_datasets.py

def labels(self) -> List[str]:
    """Retrieves the labels of the target classes.

    Target label index is the index of the corresponding label in this list.

    Returns:
        List of the labels of the target classes.
    """
    return self._class_labels

NPDataLoader

NPDataLoader(dataset, target, batch_size=1, shuffle=False, random_seed=None, drop_last=False)

DataLoader for a Numpy dataset.

This data loader encapsulates a dataset composed of numpy arrays and presents an Iterable interface. One design principle was to try to make the interface as similar as possible to a torch.DataLoader.

Attributes:

Name	Type	Description
`_dataset`		(np.ndarray) a 2d array of features
`_target`		(np.ndarray) an optional array of target values
`_batch_size`		(int) the number of elements in one batch
`_shuffle`		(bool) if True, shuffle the data at the beginning of every epoch
`_drop_last`		(bool) if True, drop the last batch if it does not contain batch_size elements
`_rng`		(np.random.Generator) the random number generator for shuffling

Parameters:

Name	Type	Description	Default
`dataset`	`ndarray`	2D Numpy array	required
`target`	`ndarray`	Numpy array of target values	required
`batch_size`	`int`	batch size for each iteration	`1`
`shuffle`	`bool`	shuffle before iteration	`False`
`random_seed`	`Optional[int \| Generator]`	an optional integer to set the numpy random seed for shuffling. If it equals None, then no attempt will be made to set the random seed.	`None`
`drop_last`	`bool`	whether to drop the last batch in case it does not fill the whole batch size	`False`

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def __init__(
    self,
    dataset: np.ndarray,
    target: np.ndarray,
    batch_size: int = 1,
    shuffle: bool = False,
    random_seed: Optional[int | np.random.Generator] = None,
    drop_last: bool = False,
):
    """Construct numpy data loader

    Args:
        dataset: 2D Numpy array
        target: Numpy array of target values
        batch_size: batch size for each iteration
        shuffle: shuffle before iteration
        random_seed: an optional integer to set the numpy random seed for shuffling. If it equals
            None, then no attempt will be made to set the random seed.
        drop_last: whether to drop the last batch in case it does not fill the whole batch size
    """

    if not isinstance(dataset, np.ndarray) or not isinstance(target, np.ndarray):
        msg = (
            f"{ErrorNumbers.FB609.value}. Wrong input type for `dataset` or `target` in NPDataLoader. "
            f"Expected type np.ndarray for both, instead got {type(dataset)} and"
            f"{type(target)} respectively."
        )
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # If the researcher gave a 1-dimensional dataset, we expand it to 2 dimensions
    if dataset.ndim == 1:
        dataset = dataset[:, np.newaxis]

    # If the researcher gave a 1-dimensional target, we expand it to 2 dimensions
    if target.ndim == 1:
        target = target[:, np.newaxis]

    if dataset.ndim != 2 or target.ndim != 2:
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Wrong shape for `dataset` or `target` in "
            f"NPDataLoader. Expected 2-dimensional arrays, instead got {dataset.ndim}- "
            f"dimensional and {target.ndim}-dimensional arrays respectively."
        )

    if len(dataset) != len(target):
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Inconsistent length for `dataset` and `target` "
            f"in NPDataLoader. Expected same length, instead got len(dataset)={len(dataset)}, "
            f"len(target)={len(target)}"
        )

    if not isinstance(batch_size, int) or batch_size <= 0:
        raise FedbiomedValueError(
            f"{ErrorNumbers.FB609.value}. Wrong value for `batch_size` parameter of "
            f"NPDataLoader. Expected a non-zero positive integer, instead got value {batch_size}."
        )

    if random_seed is not None and not isinstance(
        random_seed, (int, np.random.Generator)
    ):
        raise FedbiomedTypeError(
            f"{ErrorNumbers.FB609.value}. Wrong type for `random_seed` parameter of "
            f"NPDataLoader. Expected int or None, instead got {type(random_seed)}."
        )

    self._dataset = dataset
    self._target = target
    self._batch_size = batch_size
    self._shuffle = shuffle
    self._drop_last = drop_last
    self._rng = (
        np.random.default_rng(random_seed)
        if isinstance(random_seed, (int, type(None)))
        else random_seed
    )

Attributes

dataset `property`

dataset

Returns the encapsulated dataset

This needs to be a property to harmonize the API with torch.DataLoader, enabling us to write generic code for both DataLoaders.

target `property`

target

Returns the array of target values

This has been made a property to have a homogeneous interface with the dataset property above.

Functions

batch_size

batch_size()

Returns the batch size

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def batch_size(self) -> int:
    """Returns the batch size"""
    return self._batch_size

drop_last

drop_last()

Returns the boolean drop_last attribute

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def drop_last(self) -> bool:
    """Returns the boolean drop_last attribute"""
    return self._drop_last

n_remainder_samples

n_remainder_samples()

Returns the remainder of the division between dataset length and batch size.

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def n_remainder_samples(self) -> int:
    """Returns the remainder of the division between dataset length and batch size."""
    return len(self._dataset) % self._batch_size

rng

rng()

Returns the random number generator

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def rng(self) -> np.random.Generator:
    """Returns the random number generator"""
    return self._rng

shuffle

shuffle()

Returns the boolean shuffle attribute

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def shuffle(self) -> bool:
    """Returns the boolean shuffle attribute"""
    return self._shuffle

SerializationValidation

SerializationValidation()

Provide Validation capabilities for serializing/deserializing a [DataLoadingBlock] or [DataLoadingPlan].

When a developer inherits from [DataLoadingBlock] to define a custom loading block, they are required to call the _serialization_validator.update_validation_scheme function with a dictionary argument containing the rules to validate all the additional fields that will be used in the serialization of their loading block.

These rules must follow the syntax explained in the SchemeValidator class.

For example

    class MyLoadingBlock(DataLoadingBlock):
        def __init__(self):
            self.my_custom_data = {}
            self._serialization_validator.update_validation_scheme({
                'custom_data': {
                    'rules': [dict, ...any other rules],
                    'required': True
                }
            })
        def serialize(self):
            serialized = super().serialize()
            serialized.update({'custom_data': self.my_custom_data})
            return serialized

Attributes:

Name	Type	Description
`_validation_scheme`		(dict) an extensible set of rules to validate the DataLoadingBlock metadata.

Source code in fedbiomed/common/data/_data_loading_plan.py

def __init__(self):
    self._validation_scheme = {}

Functions

dlb_default_scheme `classmethod`

dlb_default_scheme()

The dictionary of default validation rules for a serialized [DataLoadingBlock].

Source code in fedbiomed/common/data/_data_loading_plan.py

@classmethod
def dlb_default_scheme(cls) -> Dict:
    """The dictionary of default validation rules for a serialized [DataLoadingBlock]."""
    return {
        'loading_block_class': {
            'rules': [str, cls._identifier_validation_hook],
            'required': True,
        },
        'loading_block_module': {
            'rules': [str, cls._identifier_validation_hook],
            'required': True,
        },
        'dlb_id': {
            'rules': [str, cls._serial_id_validation_hook],
            'required': True,
        },
    }

dlp_default_scheme `classmethod`

dlp_default_scheme()

The dictionary of default validation rules for a serialized [DataLoadingPlan].

Source code in fedbiomed/common/data/_data_loading_plan.py

@classmethod
def dlp_default_scheme(cls) -> Dict:
    """The dictionary of default validation rules for a serialized [DataLoadingPlan]."""
    return {
        'dlp_id': {
            'rules': [str],
            'required': True,
        },
        'dlp_name': {
            'rules': [str],
            'required': True,
        },
        'target_dataset_type': {
            'rules': [str, cls._target_dataset_type_validator],
            'required': True,
        },
        'loading_blocks': {
            'rules': [dict, cls._loading_blocks_types_validator],
            'required': True
        },
        'key_paths': {
            'rules': [dict, cls._key_paths_validator],
            'required': True
        }
    }

update_validation_scheme

update_validation_scheme(new_scheme)

Updates the validation scheme.

Parameters:

Name	Type	Description	Default
`new_scheme`	`dict`	(dict) new dict of rules	required

Source code in fedbiomed/common/data/_data_loading_plan.py

def update_validation_scheme(self, new_scheme: dict) -> None:
    """Updates the validation scheme.

    Args:
        new_scheme: (dict) new dict of rules
    """
    self._validation_scheme.update(new_scheme)

validate

validate(dlb_metadata, exception_type, only_required=True)

Validate a dict of dlb_metadata according to the _validation_scheme.

Parameters:

Name	Type	Description	Default
`dlb_metadata`	`dict)`	the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the node database.	required
`exception_type`	`Type[FedbiomedError]`	the type of the exception to be raised when validation fails.	required
`only_required`	`bool)`	see SchemeValidator.populate_with_defaults	`True`

Raises: exception_type: if the validation fails.

Source code in fedbiomed/common/data/_data_loading_plan.py

def validate(self,
             dlb_metadata: Dict,
             exception_type: Type[FedbiomedError],
             only_required: bool = True) -> None:
    """Validate a dict of dlb_metadata according to the _validation_scheme.

    Args:
        dlb_metadata (dict) : the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the
            node database.
        exception_type (Type[FedbiomedError]): the type of the exception to be raised when validation fails.
        only_required (bool) : see SchemeValidator.populate_with_defaults
    Raises:
        exception_type: if the validation fails.
    """
    try:
        sc = SchemeValidator(self._validation_scheme)
    except RuleError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

    try:
        dlb_metadata = sc.populate_with_defaults(dlb_metadata,
                                                 only_required=only_required)
    except ValidatorError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

    try:
        sc.validate(dlb_metadata)
    except ValidateError as e:
        msg = ErrorNumbers.FB614.value + f": {e}"
        logger.critical(msg)
        raise exception_type(msg)

SkLearnDataManager

SkLearnDataManager(inputs, target, **kwargs)

Bases: object

Wrapper for pd.DataFrame, pd.Series and np.ndarray datasets.

Manages datasets for scikit-learn based model training. Responsible for managing inputs, and target variables that have been provided in training_data of scikit-learn based training plans.

The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the [fedbiomed.common.data.NPDataLoader] class for more details.

Parameters:

Name	Type	Description	Default
`inputs`	`Union[ndarray, DataFrame, Series]`	Independent variables (inputs, features) for model training	required
`target`	`Union[ndarray, DataFrame, Series]`	Dependent variable/s (target) for model training and validation	required
`**kwargs`	`dict`	Loader arguments	`{}`

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def __init__(
    self,
    inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
    target: Union[np.ndarray, pd.DataFrame, pd.Series],
    **kwargs: dict,
):
    """Construct a SkLearnDataManager from an array of inputs and an array of targets.

    The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated
    when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the
    [fedbiomed.common.data.NPDataLoader] class for more details.

    Args:
        inputs: Independent variables (inputs, features) for model training
        target: Dependent variable/s (target) for model training and validation
        **kwargs: Loader arguments
    """

    if not isinstance(
        inputs, (np.ndarray, pd.DataFrame, pd.Series)
    ) or not isinstance(target, (np.ndarray, pd.DataFrame, pd.Series)):
        msg = (
            f"{ErrorNumbers.FB609.value}. Parameters `inputs` and `target` for "
            f"initialization of {self.__class__.__name__} should be one of np.ndarray, pd.DataFrame, pd.Series"
        )
        logger.error(msg)
        raise FedbiomedTypeError(msg)

    # Convert pd.DataFrame or pd.Series to np.ndarray for `inputs`
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self._inputs = inputs.to_numpy()
    else:
        self._inputs = inputs

    # Convert pd.DataFrame or pd.Series to np.ndarray for `target`
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self._target = target.to_numpy()
    else:
        self._target = target

    # rand_seed = kwargs.get('random_seed')
    # self.rng(rand_seed)

    # Subset None means that train/validation split has not been performed
    self._subset_test: Union[Tuple[np.ndarray, np.ndarray], None] = None
    self._subset_train: Union[Tuple[np.ndarray, np.ndarray], None] = None

    self.training_index: List[int] = []
    self.testing_index: List[int] = []
    self.test_ratio: Optional[float] = None
    self._is_shuffled_testing_dataset: bool = False
    if "shuffle_testing_dataset" in kwargs:
        self._is_shuffled_testing_dataset: bool = kwargs.pop(
            "shuffle_testing_dataset"
        )

    # Additional loader arguments
    self._loader_arguments = kwargs

Attributes

test_ratio `instance-attribute`

test_ratio = None

testing_index `instance-attribute`

testing_index = []

training_index `instance-attribute`

training_index = []

Functions

dataset

dataset()

Gets the entire registered dataset.

This method returns whole dataset as it is without any split.

Returns:

Name	Type	Description
`inputs`	`ndarray`	Input variables for model training
`targets`	`ndarray`	Target variable for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def dataset(self) -> Tuple[np.ndarray, np.ndarray]:
    """Gets the entire registered dataset.

    This method returns whole dataset as it is without any split.

    Returns:
         inputs: Input variables for model training
         targets: Target variable for model training
    """
    return self._inputs, self._target

load_state

load_state(state)

Loads state of the data loader

It currently keep only testing index, training index and test ratio as state.

Parameters:

Name	Type	Description	Default
`state`	`Dict`	Object containing data loader state.	required

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def load_state(self, state: Dict) -> None:
    """Loads state of the data loader


    It currently keep only testing index, training index and test ratio
    as state.

    Args:
        state: Object containing data loader state.
    """

    self.testing_index = state.get("testing_index", [])
    self.training_index = state.get("training_index", [])
    self.test_ratio = state.get("test_ratio", None)

save_state

save_state()

Gets state of the data loader.

Returns:

Type	Description
`Dict`	A Dict containing data loader state.

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def save_state(self) -> Dict:
    """Gets state of the data loader.

    Returns:
        A Dict containing data loader state.
    """
    _loader_args = {}
    _loader_args["training_index"], _loader_args["testing_index"] = (
        self.training_index,
        self.testing_index,
    )
    _loader_args["test_ratio"] = self.test_ratio

    return _loader_args

split

split(test_ratio, test_batch_size, is_shuffled_testing_dataset=False)

Splits np.ndarray dataset into train and validation.

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	Ratio for validation set partition. Rest of the samples will be used for training	required

Raises:

Type	Description
`FedbiomedSkLearnDataManagerError`	If the `test_ratio` is not between 0 and 1

Returns:

Name	Type	Description
`train_loader`	`NPDataLoader`	NPDataLoader of input variables for model training
`test_loader`	`NPDataLoader`	NPDataLoader of target variable for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def split(
    self,
    test_ratio: float,
    test_batch_size: int,
    is_shuffled_testing_dataset: bool = False,
) -> Tuple[NPDataLoader, NPDataLoader]:
    """Splits `np.ndarray` dataset into train and validation.

    Args:
         test_ratio: Ratio for validation set partition. Rest of the samples will be used for training

    Raises:
        FedbiomedSkLearnDataManagerError: If the `test_ratio` is not between 0 and 1

    Returns:
         train_loader: NPDataLoader of input variables for model training
         test_loader: NPDataLoader of target variable for model training
    """
    if not isinstance(test_ratio, float):
        raise FedbiomedTypeError(
            f"{ErrorNumbers.FB609.value}: The argument `ratio` should be type "
            f"`float` not {type(test_ratio)}"
        )

    if test_ratio < 0.0 or test_ratio > 1.0:
        raise FedbiomedTypeError(
            f"{ErrorNumbers.FB609.value}: The argument `ratio` should be equal or between "
            f"0 and 1, not {test_ratio}"
        )

    empty_subset = (np.array([]), np.array([]))

    if self.test_ratio != test_ratio and self.test_ratio is not None:
        if not is_shuffled_testing_dataset:
            logger.info(
                "`test_ratio` value has changed: this will change the testing dataset"
            )
        is_shuffled_testing_dataset = True

    if test_ratio <= 0.0:
        self._subset_train = (self._inputs, self._target)
        self._subset_test = empty_subset
        self.training_index, self.testing_index = list(range(len(self._inputs))), []
    elif test_ratio >= 1.0:
        self._subset_train = empty_subset
        self._subset_test = (self._inputs, self._target)
        self.training_index, self.testing_index = [], list(range(len(self._inputs)))

    else:
        _is_loading_failed: bool = False
        if self.testing_index and not is_shuffled_testing_dataset:
            # reloading testing dataset from previous rounds
            try:
                self._load_indexes(self.training_index, self.testing_index)
            except IndexError:
                _is_loading_failed = True
        if (
            not self.testing_index
            or is_shuffled_testing_dataset
            or _is_loading_failed
        ):
            (x_train, x_test, y_train, y_test, idx_train, idx_test) = (
                train_test_split(
                    self._inputs,
                    self._target,
                    np.arange(len(self._inputs)),
                    test_size=test_ratio,
                )
            )
            self._subset_test = (x_test, y_test)
            self._subset_train = (x_train, y_train)
            self.training_index = idx_train.tolist()
            self.testing_index = idx_test.tolist()

    if not test_batch_size:
        test_batch_size = len(self._subset_test)

    self.test_ratio = test_ratio  # float(np.clip(0, 1, test_ratio))

    # self._loader_arguments['random_seed'] = self._rng
    return self._subset_loader(
        self._subset_train, **self._loader_arguments
    ), self._subset_loader(self._subset_test, batch_size=test_batch_size)

subset_test

subset_test()

Gets Subset of dataset for validation partition.

Returns:

Name	Type	Description
`test_inputs`	`Union[Tuple[ndarray, ndarray], None]`	Input variables of validation subset for model validation
`test_target`	`Union[Tuple[ndarray, ndarray], None]`	Target variable of validation subset for model validation

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def subset_test(self) -> Union[Tuple[np.ndarray, np.ndarray], None]:
    """Gets Subset of dataset for validation partition.

    Returns:
        test_inputs: Input variables of validation subset for model validation
        test_target: Target variable of validation subset for model validation
    """
    return self._subset_test

subset_train

subset_train()

Gets Subset for train partition.

Returns:

Name	Type	Description
`test_inputs`	`Union[Tuple[ndarray, ndarray], None]`	Input variables of training subset for model training
`test_target`	`Union[Tuple[ndarray, ndarray], None]`	Target variable of training subset for model training

Source code in fedbiomed/common/data/_sklearn_data_manager.py

def subset_train(self) -> Union[Tuple[np.ndarray, np.ndarray], None]:
    """Gets Subset for train partition.

    Returns:
        test_inputs: Input variables of training subset for model training
        test_target: Target variable of training subset for model training
    """

    return self._subset_train

TabularDataset

TabularDataset(inputs, target)

Bases: Dataset

Torch based Dataset object to create torch Dataset from given numpy or dataframe type of input and target variables

Parameters:

Name	Type	Description	Default
`inputs`	`Union[ndarray, DataFrame, Series]`	Input variables that will be passed to network	required
`target`	`Union[ndarray, DataFrame, Series]`	Target variable for output layer	required

Raises:

Type	Description
`FedbiomedTorchDatasetError`	If input variables and target variable does not have equal length/size

Source code in fedbiomed/common/data/_tabular_dataset.py

def __init__(self,
             inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
             target: Union[np.ndarray, pd.DataFrame, pd.Series]):
    """Constructs PyTorch dataset object

    Args:
        inputs: Input variables that will be passed to network
        target: Target variable for output layer

    Raises:
        FedbiomedTorchDatasetError: If input variables and target variable does not have
            equal length/size
    """

    # Inputs and target variable should be converted to the torch tensors
    # PyTorch provides `from_numpy` function to convert numpy arrays to
    # torch tensor. Therefore, if the arguments `inputs` and `target` are
    # instance one of `pd.DataFrame` or `pd.Series`, they should be converted to
    # numpy arrays
    if isinstance(inputs, (pd.DataFrame, pd.Series)):
        self.inputs = inputs.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.inputs = inputs
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `inputs` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
    # Configuring self.target attribute
    if isinstance(target, (pd.DataFrame, pd.Series)):
        self.target = target.to_numpy()
    elif isinstance(inputs, np.ndarray):
        self.target = target
    else:
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `target` should be "
                                                f"an instance one of np.ndarray, pd.DataFrame or pd.Series")

    # The lengths should be equal
    if len(self.inputs) != len(self.target):
        raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: Length of input variables and target "
                                                f"variable does not match. Please make sure that they have "
                                                f"equal size while creating the method `training_data` of "
                                                f"TrainingPlan")

    # Convert `inputs` adn `target` to Torch floats
    self.inputs = from_numpy(self.inputs).float()
    self.target = from_numpy(self.target).float()

Attributes

inputs `instance-attribute`

inputs = float()

target `instance-attribute`

target = float()

Functions

get_dataset_type `staticmethod`

get_dataset_type()

Source code in fedbiomed/common/data/_tabular_dataset.py

@staticmethod
def get_dataset_type() -> DatasetTypes:
    return DatasetTypes.TABULAR

TorchDataManager

TorchDataManager(dataset, **kwargs)

Bases: object

Wrapper for PyTorch Dataset to manage loading operations for validation and train.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	Dataset object for torch.utils.data.DataLoader	required
`**kwargs`	`dict`	Arguments for PyTorch `DataLoader`	`{}`

Raises:

Type	Description
`FedbiomedTorchDataManagerError`	If the argument `dataset` is not an instance of `torch.utils.data.Dataset`

Source code in fedbiomed/common/data/_torch_data_manager.py

def __init__(self, dataset: Dataset, **kwargs: dict):
    """Construct  of class

    Args:
        dataset: Dataset object for torch.utils.data.DataLoader
        **kwargs: Arguments for PyTorch `DataLoader`

    Raises:
        FedbiomedTorchDataManagerError: If the argument `dataset` is not an instance of `torch.utils.data.Dataset`
    """

    # TorchDataManager should get `dataset` argument as an instance of torch.utils.data.Dataset
    if not isinstance(dataset, Dataset):
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: The attribute `dataset` should an instance "
            f"of `torch.utils.data.Dataset`, please use `Dataset` as parent class for"
            f"your custom torch dataset object"
        )

    self._dataset = dataset
    self._loader_arguments = kwargs

    self._rng = self.rng(self._loader_arguments.get("random_state"))
    self._subset_test: Union[Subset, None] = None
    self._subset_train: Union[Subset, None] = None
    self._is_shuffled_testing_dataset: bool = (
        False  # self._loader_arguments.get('shuffle_testing_dataset', False)
    )
    self.training_index: List[int] = []
    self.testing_index: List[int] = []
    self.test_ratio: Optional[float] = None

Attributes

dataset `property`

dataset

Gets dataset.

Returns:

Type	Description
`Dataset`	PyTorch dataset instance

test_ratio `instance-attribute`

test_ratio = None

testing_index `instance-attribute`

testing_index = []

training_index `instance-attribute`

training_index = []

Functions

load_all_samples

load_all_samples()

Loading all samples as PyTorch DataLoader without splitting.

Returns:

Type	Description
`DataLoader`	Dataloader for entire datasets. `DataLoader` arguments will be retrieved from the `**kwargs` which is defined while initializing the class

Source code in fedbiomed/common/data/_torch_data_manager.py

def load_all_samples(self) -> DataLoader:
    """Loading all samples as PyTorch DataLoader without splitting.

    Returns:
        Dataloader for entire datasets. `DataLoader` arguments will be retrieved from the `**kwargs` which
            is defined while initializing the class
    """
    return self._create_torch_data_loader(self._dataset, **self._loader_arguments)

load_state

load_state(state)

Loads state of the data loader

It currently keep only testing index, training index and test ratio as state.

Parameters:

Name	Type	Description	Default
`state`	`Dict`	Object containing data loader state.	required

Source code in fedbiomed/common/data/_torch_data_manager.py

def load_state(self, state: Dict):
    """Loads state of the data loader


    It currently keep only testing index, training index and test ratio
    as state.

    Args:
        state: Object containing data loader state.
    """
    self.testing_index = state.get("testing_index", [])
    self.training_index = state.get("training_index", [])
    self.test_ratio = state.get("test_ratio", None)

rng `staticmethod`

rng(rng=None, device=None)

Random number generator

Returns:

Type	Description
`Union[None, Generator]`	None if rng is None else a torch generator.

Source code in fedbiomed/common/data/_torch_data_manager.py

@staticmethod
def rng(
    rng: Optional[int] = None,
    device: Optional[str | torch.device] = None
) -> Union[None, torch.Generator]:
    """Random number generator

    Returns:
        None if rng is None else a torch generator.
    """

    return None if rng is None else torch.Generator(device).manual_seed(rng)

save_state

save_state()

Gets state of the data loader.

Returns:

Type	Description
`Dict`	A Dict containing data loader state.

Source code in fedbiomed/common/data/_torch_data_manager.py

def save_state(self) -> Dict:
    """Gets state of the data loader.

    Returns:
        A Dict containing data loader state.
    """

    data_manager_state = {}
    data_manager_state["training_index"] = self.training_index
    data_manager_state["testing_index"] = self.testing_index
    data_manager_state["test_ratio"] = self.test_ratio
    return data_manager_state

split

split(test_ratio, test_batch_size, is_shuffled_testing_dataset=False)

Splitting PyTorch Dataset into train and validation.

Parameters:

Name	Type	Description	Default
`test_ratio`	`float`	Split ratio for validation set ratio. Rest of the samples will be used for training	required

Raises: FedbiomedTorchDataManagerError: If the ratio is not in good format

Returns:

Name	Type	Description
`train_loader`	`Union[DataLoader, None]`	DataLoader for training subset. `None` if the `test_ratio` is `1`
`test_loader`	`Union[DataLoader, None]`	DataLoader for validation subset. `None` if the `test_ratio` is `0`

Source code in fedbiomed/common/data/_torch_data_manager.py

def split(
    self,
    test_ratio: float,
    test_batch_size: Union[int, None],
    is_shuffled_testing_dataset: bool = False,
) -> Tuple[Union[DataLoader, None], Union[DataLoader, None]]:
    """Splitting PyTorch Dataset into train and validation.

    Args:
         test_ratio: Split ratio for validation set ratio. Rest of the samples will be used for training
    Raises:
        FedbiomedTorchDataManagerError: If the ratio is not in good format

    Returns:
         train_loader: DataLoader for training subset. `None` if the `test_ratio` is `1`
         test_loader: DataLoader for validation subset. `None` if the `test_ratio` is `0`
    """

    # Check the argument `ratio` is of type `float`
    if not isinstance(test_ratio, (float, int)):
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: The argument `ratio` should be "
            f"type `float` or `int` not {type(test_ratio)}"
        )

    # Check ratio is valid for splitting
    if test_ratio < 0 or test_ratio > 1:
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: The argument `ratio` should be "
            f"equal or between 0 and 1, not {test_ratio}"
        )

    # If `Dataset` has proper data attribute
    # try to get shape from self.data
    if not hasattr(self._dataset, "__len__"):
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
            f"{str(self._dataset)} without `__len__`.  Please make sure "
            f"that `__len__` method has been added to custom dataset. "
            f"This method should return total number of samples."
        )

    try:
        samples = len(self._dataset)
    except AttributeError as e:
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
            f"{str(self._dataset)} due to undefined attribute, {str(e)}"
        )
    except TypeError as e:
        raise FedbiomedTorchDataManagerError(
            f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
            f"{str(self._dataset)}, {str(e)}"
        )

    if self.test_ratio != test_ratio and self.test_ratio is not None:
        if not is_shuffled_testing_dataset:
            logger.info(
                "`test_ratio` value has changed: this will change the testing dataset"
            )
        is_shuffled_testing_dataset = True
    _is_loading_failed: bool = False
    # Calculate number of samples for train and validation subsets
    test_samples = math.floor(samples * test_ratio)
    if self.testing_index and not is_shuffled_testing_dataset:
        try:
            self._load_indexes(self.training_index, self.testing_index)
        except IndexError:
            _is_loading_failed = True
    if (
        not self.testing_index or is_shuffled_testing_dataset
    ) or _is_loading_failed:
        train_samples = samples - test_samples

        self._subset_train, self._subset_test = random_split(
            self._dataset, [train_samples, test_samples], generator=self.rng()
        )

        self.testing_index = self._subset_test.indices
        self.training_index = self._subset_train.indices

    if not test_batch_size:

        test_batch_size = len(self._subset_test)

    self.test_ratio = test_ratio

    loaders = (
        self._subset_loader(self._subset_train, **self._loader_arguments),
        self._subset_loader(self._subset_test, batch_size=test_batch_size),
    )

    return loaders

subset_test

subset_test()

Gets validation subset of the dataset.

Returns:

Type	Description
`Subset`	Validation subset

Source code in fedbiomed/common/data/_torch_data_manager.py

def subset_test(self) -> Subset:
    """Gets validation subset of the dataset.

    Returns:
        Validation subset
    """

    return self._subset_test

subset_train

subset_train()

Gets train subset of the dataset.

Returns:

Type	Description
`Subset`	Train subset

Source code in fedbiomed/common/data/_torch_data_manager.py

def subset_train(self) -> Subset:
    """Gets train subset of the dataset.

    Returns:
        Train subset
    """
    return self._subset_train

to_sklearn

to_sklearn()

Converts PyTorch Dataset to sklearn data manager of Fed-BioMed.

Returns:

Type	Description
`SkLearnDataManager`	Data manager to use in SkLearn base training plans

Source code in fedbiomed/common/data/_torch_data_manager.py

def to_sklearn(self) -> SkLearnDataManager:
    """Converts PyTorch `Dataset` to sklearn data manager of Fed-BioMed.

    Returns:
        Data manager to use in SkLearn base training plans
    """

    loader = self._create_torch_data_loader(
        self._dataset, batch_size=len(self._dataset)
    )
    # Iterate over samples and get input variable and target variable
    inputs = next(iter(loader))[0].numpy()
    target = next(iter(loader))[1].numpy()
    sklearn_data_manager = SkLearnDataManager(
        inputs=inputs, target=target, **self._loader_arguments
    )
    sklearn_data_manager.testing_index = self.testing_index
    sklearn_data_manager.training_index = self.training_index
    sklearn_data_manager.test_ratio = self.test_ratio
    return sklearn_data_manager

Classes

DataLoadingBlock

Functions

apply abstractmethod

deserialize

get_serialization_id

instantiate_class staticmethod

instantiate_key staticmethod

serialize

DataLoadingPlan

Attributes

desc instance-attribute

dlp_id instance-attribute

target_dataset_type instance-attribute

Functions

deserialize

infer_dataset_type staticmethod

serialize

DataLoadingPlanMixin

Functions

apply_dlb

clear_dlp

set_dlp

DataManager

Functions

extend_loader_args

load

MapperBlock

Attributes

map instance-attribute

Functions

apply

deserialize

serialize

MedicalFolderBase

Attributes

default_modality_names class-attribute instance-attribute

root property writable

Functions

available_subjects

complete_subjects

demographics_column_names staticmethod

get_dataset_type staticmethod

is_modalities_existing

modalities

modalities_candidates_from_subfolders

read_demographics staticmethod

subjects_with_imaging_data_folders

validate_MedicalFolder_root_folder staticmethod

MedicalFolderController

Functions

load_MedicalFolder

subject_modality_status

MedicalFolderDataset

Attributes

ALLOWED_EXTENSIONS class-attribute instance-attribute

demographics cached property

index_col property writable

subjects_has_all_modalities property

subjects_registered_in_demographics cached property

tabular_file property writable

Functions

get_nontransformed_item

load_images

set_dataset_parameters

shape

subject_folders

MedicalFolderLoadingBlockTypes

Attributes

MODALITIES_TO_FOLDERS class-attribute instance-attribute

NIFTIFolderDataset

Functions

files

labels

NPDataLoader

Attributes

dataset property

target property

Functions

apply `abstractmethod`

instantiate_class `staticmethod`

instantiate_key `staticmethod`

desc `instance-attribute`

dlp_id `instance-attribute`

target_dataset_type `instance-attribute`

infer_dataset_type `staticmethod`

map `instance-attribute`

default_modality_names `class-attribute` `instance-attribute`

root `property` `writable`

demographics_column_names `staticmethod`

get_dataset_type `staticmethod`

read_demographics `staticmethod`

validate_MedicalFolder_root_folder `staticmethod`

ALLOWED_EXTENSIONS `class-attribute` `instance-attribute`

demographics `cached` `property`

index_col `property` `writable`

subjects_has_all_modalities `property`

subjects_registered_in_demographics `cached` `property`

tabular_file `property` `writable`

MODALITIES_TO_FOLDERS `class-attribute` `instance-attribute`

dataset `property`

target `property`

dlb_default_scheme `classmethod`

dlp_default_scheme `classmethod`

test_ratio `instance-attribute`

testing_index `instance-attribute`

training_index `instance-attribute`

inputs `instance-attribute`

target `instance-attribute`

get_dataset_type `staticmethod`

dataset `property`

test_ratio `instance-attribute`

testing_index `instance-attribute`

training_index `instance-attribute`

rng `staticmethod`