Classes that simplify imports from fedbiomed.common.data
Classes
DataLoadingBlock
DataLoadingBlock()
Bases: ABC
The building blocks of a DataLoadingPlan.
A DataLoadingBlock describes an intermediary layer between the researcher and the node's filesystem. It allows the node to specify a customization in the way data is "perceived" by the data loaders during training.
A DataLoadingBlock is identified by its type_id attribute. Thus, this attribute should be unique among all DataLoadingBlockTypes in the same DataLoadingPlan. Moreover, we may test equality between a DataLoadingBlock and a string by checking its type_id, as a means of easily testing whether a DataLoadingBlock is contained in a collection.
Correct usage of this class requires creating ad-hoc subclasses. The DataLoadingBlock class is not intended to be instantiated directly.
Subclasses of DataLoadingBlock must respect the following conditions:
- implement a default constructor
- the implemented constructor must call
super().__init__()
- extend the serialize(self) and the deserialize(self, load_from: dict) functions
- both serialize and deserialize must call super's serialize and deserialize respectively
- the deserialize function must always return self
- the serialize function must update the dict returned by super's serialize
- implement an apply function that takes arbitrary arguments and applies the logic of the loading_block
- update the _validation_scheme to define rules for all new fields returned by the serialize function
Attributes:
Name | Type | Description |
---|---|---|
__serialization_id | (str) identifies one serialized instance of the DataLoadingBlock |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self.__serialization_id = 'serialized_dlb_' + str(uuid.uuid4())
self._serialization_validator = SerializationValidation()
self._serialization_validator.update_validation_scheme(SerializationValidation.dlb_default_scheme())
Functions
apply abstractmethod
apply(*args, **kwargs)
Abstract method representing an application of the DataLoadingBlock
Source code in fedbiomed/common/data/_data_loading_plan.py
@abstractmethod
def apply(self, *args, **kwargs):
"""Abstract method representing an application of the DataLoadingBlock
"""
pass
deserialize
deserialize(load_from)
Reconstruct the DataLoadingBlock from a serialized version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
load_from | dict | a dictionary as obtained by the serialize function. | required |
Returns: the self instance
Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, load_from: dict) -> TDataLoadingBlock:
"""Reconstruct the DataLoadingBlock from a serialized version.
Args:
load_from (dict): a dictionary as obtained by the serialize function.
Returns:
the self instance
"""
self._serialization_validator.validate(load_from, FedbiomedLoadingBlockValueError)
self.__serialization_id = load_from['dlb_id']
return self
get_serialization_id
get_serialization_id()
Expose serialization id as read-only
Source code in fedbiomed/common/data/_data_loading_plan.py
def get_serialization_id(self):
"""Expose serialization id as read-only"""
return self.__serialization_id
instantiate_class staticmethod
instantiate_class(loading_block)
Instantiate one DataLoadingBlock object of the type defined in the arguments.
Uses the loading_block_module
and loading_block_class
fields of the loading_block argument to identify the type of DataLoadingBlock to be instantiated, then calls its default constructor. Note that this function does not call deserialize.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
loading_block | dict | DataLoadingBlock metadata in the format returned by the serialize function. | required |
Returns: A default-constructed instance of a DataLoadingBlock of the type defined in the metadata. Raises: FedbiomedLoadingBlockError: if the instantiation process raised any exception.
Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def instantiate_class(loading_block: dict) -> TDataLoadingBlock:
"""Instantiate one [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
object of the type defined in the arguments.
Uses the `loading_block_module` and `loading_block_class` fields of the loading_block argument to
identify the type of [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
to be instantiated, then calls its default constructor.
Note that this function **does not call deserialize**.
Args:
loading_block (dict): [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
metadata in the format returned by the serialize function.
Returns:
A default-constructed instance of a
[DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
of the type defined in the metadata.
Raises:
FedbiomedLoadingBlockError: if the instantiation process raised any exception.
"""
try:
dlb_module = import_module(loading_block['loading_block_module'])
dlb = eval(f"dlb_module.{loading_block['loading_block_class']}()")
except Exception as e:
msg = f"{ErrorNumbers.FB614.value}: could not instantiate DataLoadingBlock from the following metadata: " +\
f"{loading_block} because of {type(e).__name__}: {e}"
logger.debug(msg)
raise FedbiomedLoadingBlockError(msg)
return dlb
instantiate_key staticmethod
instantiate_key(key_module, key_classname, loading_block_key_str)
Imports and loads DataLoadingBlockTypes regarding the passed arguments
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key_module | str | description | required |
key_classname | str | description | required |
loading_block_key_str | str | description | required |
Raises:
Type | Description |
---|---|
FedbiomedDataLoadingPlanError | description |
Returns:
Name | Type | Description |
---|---|---|
DataLoadingBlockTypes | DataLoadingBlockTypes | description |
Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def instantiate_key(key_module: str, key_classname: str, loading_block_key_str: str) -> DataLoadingBlockTypes:
"""Imports and loads [DataLoadingBlockTypes][fedbiomed.common.constants.DataLoadingBlockTypes]
regarding the passed arguments
Args:
key_module (str): _description_
key_classname (str): _description_
loading_block_key_str (str): _description_
Raises:
FedbiomedDataLoadingPlanError: _description_
Returns:
DataLoadingBlockTypes: _description_
"""
try:
keys = import_module(key_module)
loading_block_key = eval(f"keys.{key_classname}('{loading_block_key_str}')")
except Exception as e:
msg = f"{ErrorNumbers.FB615.value} Error deserializing loading block key " + \
f"{loading_block_key_str} with path {key_module}.{key_classname} " + \
f"because of {type(e).__name__}: {e}"
logger.debug(msg)
raise FedbiomedDataLoadingPlanError(msg)
return loading_block_key
serialize
serialize()
Serializes the class in a format similar to json.
Returns:
Type | Description |
---|---|
dict | a dictionary of key-value pairs sufficient for reconstructing |
dict | the DataLoadingBlock. |
Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> dict:
"""Serializes the class in a format similar to json.
Returns:
a dictionary of key-value pairs sufficient for reconstructing
the DataLoadingBlock.
"""
return dict(
loading_block_class=self.__class__.__qualname__,
loading_block_module=self.__module__,
dlb_id=self.__serialization_id
)
DataLoadingPlan
DataLoadingPlan(*args, **kwargs)
Bases: Dict[DataLoadingBlockTypes, DataLoadingBlock]
Customizations to the way the data is loaded and presented for training.
A DataLoadingPlan is a dictionary of {name: DataLoadingBlock} pairs. Each DataLoadingBlock represents a customization to the way data is loaded and presented to the researcher. These customizations are defined by the node, but they operate on a Dataset class, which is defined by the library and instantiated by the researcher.
To exploit this functionality, a Dataset must be modified to accept the customizations provided by the DataLoadingPlan. To simplify this process, we provide the DataLoadingPlanMixin class below.
The DataLoadingPlan class should be instantiated directly, no subclassing is needed. The DataLoadingPlan is a dict, and exposes the same interface as a dict.
Attributes:
Name | Type | Description |
---|---|---|
dlp_id | str representing a unique plan id (auto-generated) | |
desc | str representing an optional user-friendly short description | |
target_dataset_type | a DatasetTypes enum representing the type of dataset targeted by this DataLoadingPlan |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self, *args, **kwargs):
super(DataLoadingPlan, self).__init__(*args, **kwargs)
self.dlp_id = 'dlp_' + str(uuid.uuid4())
self.desc = ""
self.target_dataset_type = DatasetTypes.NONE
self._serialization_validation = SerializationValidation()
self._serialization_validation.update_validation_scheme(SerializationValidation.dlp_default_scheme())
Attributes
desc instance-attribute
desc = ''
dlp_id instance-attribute
dlp_id = 'dlp_' + str(uuid4())
target_dataset_type instance-attribute
target_dataset_type = NONE
Functions
deserialize
deserialize(serialized_dlp, serialized_loading_blocks)
Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.
Calling this function will clear the contained [DataLoadingBlockTypes].
This function may not be used to "update" nor to "append to" a DataLoadingPlan.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
serialized_dlp | dict | a dictionary of data loading plan metadata, as obtained from the first output of the serialize function | required |
serialized_loading_blocks | List[dict] | a list of dictionaries of loading_block metadata, as obtained from the second output of the serialize function | required |
Returns: the self instance
Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, serialized_dlp: dict, serialized_loading_blocks: List[dict]) -> TDataLoadingPlan:
"""Reconstruct the DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] from a serialized version.
!!! warning "Calling this function will *clear* the contained [DataLoadingBlockTypes]."
This function may not be used to "update" nor to "append to"
a [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan].
Args:
serialized_dlp: a dictionary of data loading plan metadata, as obtained from the first output of the
serialize function
serialized_loading_blocks: a list of dictionaries of loading_block metadata, as obtained from the
second output of the serialize function
Returns:
the self instance
"""
self._serialization_validation.validate(serialized_dlp, FedbiomedDataLoadingPlanValueError)
self.clear()
self.dlp_id = serialized_dlp['dlp_id']
self.desc = serialized_dlp['dlp_name']
self.target_dataset_type = DatasetTypes(serialized_dlp['target_dataset_type'])
for loading_block_key_str, dlb_id in serialized_dlp['loading_blocks'].items():
key_module, key_classname = serialized_dlp['key_paths'][loading_block_key_str]
loading_block_key = DataLoadingBlock.instantiate_key(key_module, key_classname, loading_block_key_str)
loading_block = next(filter(lambda x: x['dlb_id'] == dlb_id,
serialized_loading_blocks))
dlb = DataLoadingBlock.instantiate_class(loading_block)
self[loading_block_key] = dlb.deserialize(loading_block)
return self
infer_dataset_type staticmethod
infer_dataset_type(dataset)
Infer the type of a given dataset.
This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to apply some heuristics to guess the type of dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | Any | the dataset whose type we want to infer. | required |
Returns: a DatasetTypes enum element which identifies the type of the dataset. Raises: FedbiomedDataLoadingPlanValueError: if the dataset does not have a get_dataset_type
method and moreover the type could not be guessed.
Source code in fedbiomed/common/data/_data_loading_plan.py
@staticmethod
def infer_dataset_type(dataset: Any) -> DatasetTypes:
"""Infer the type of a given dataset.
This function provides the mapping between a dataset's class and the DatasetTypes enum. If the dataset exposes
the correct interface (i.e. the get_dataset_type method) then it directly calls that, otherwise it tries to
apply some heuristics to guess the type of dataset.
Args:
dataset: the dataset whose type we want to infer.
Returns:
a DatasetTypes enum element which identifies the type of the dataset.
Raises:
FedbiomedDataLoadingPlanValueError: if the dataset does not have a `get_dataset_type` method and moreover
the type could not be guessed.
"""
if hasattr(dataset, 'get_dataset_type'):
return dataset.get_dataset_type()
elif dataset.__class__.__name__ == 'ImageFolder':
# ImageFolder could be both an images type or mednist. Try to identify mednist with some heuristic.
if hasattr(dataset, 'classes') and \
all([x in dataset.classes for x in ['AbdomenCT', 'BreastMRI', 'CXR', 'ChestCT', 'Hand', 'HeadCT']]):
return DatasetTypes.MEDNIST
else:
return DatasetTypes.IMAGES
elif dataset.__class__.__name__ == 'MNIST':
return DatasetTypes.DEFAULT
msg = f"{ErrorNumbers.FB615.value} Trying to infer dataset type of {dataset} is not supported " + \
f"for datasets of type {dataset.__class__.__qualname__}"
logger.debug(msg)
raise FedbiomedDataLoadingPlanValueError(msg)
serialize
serialize()
Serializes the class in a format similar to json.
Returns:
Type | Description |
---|---|
Tuple[dict, List] | a tuple sufficient for reconstructing the DataLoading plan. It includes: - a dictionary of key-value pairs with the DataLoadingPlan parameters. - a list of dict containing the data for reconstruction all the DataLoadingBlock of the DataLoadingPlan |
Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> Tuple[dict, List]:
"""Serializes the class in a format similar to json.
Returns:
a tuple sufficient for reconstructing the DataLoading plan. It includes:
- a dictionary of key-value pairs with the
[DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan] parameters.
- a list of dict containing the data for reconstruction all the DataLoadingBlock
of the [DataLoadingPlan][fedbiomed.common.data._data_loading_plan.DataLoadingPlan]
"""
return dict(
dlp_id=self.dlp_id,
dlp_name=self.desc,
target_dataset_type=self.target_dataset_type.value,
loading_blocks={key.value: dlb.get_serialization_id() for key, dlb in self.items()},
key_paths={key.value: (f"{key.__module__}", f"{key.__class__.__qualname__}") for key in self.keys()}
), [dlb.serialize() for dlb in self.values()]
DataLoadingPlanMixin
DataLoadingPlanMixin()
Utility class to enable DLP functionality in a dataset.
Any Dataset class that inherits from [DataLoadingPlanMixin] will have the basic tools necessary to support a DataLoadingPlan. Typically, the logic of each specific DataLoadingBlock in the DataLoadingPlan will be implemented in the form of hooks that are called within the Dataset's implementation using the helper function apply_dlb defined below.
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self._dlp = None
Functions
apply_dlb
apply_dlb(default_ret_value, dlb_key, *args, **kwargs)
Apply one DataLoadingBlock identified by its key.
Note that we want to easily support the case where the DataLoadingPlan is not activated, or the requested loading block is not contained in the DataLoadingPlan. This is achieved by providing a default return value to be returned when the above conditions are met. Hence, most of the calls to apply_dlb will look like this:
value = self.apply_dlb(value, 'my-loading-block', my_apply_args)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
default_ret_value | Any | the value to be returned in case that the dlp functionality is not required | required |
dlb_key | DataLoadingBlockTypes | the key of the DataLoadingBlock to be applied | required |
*args | Optional[Any] | forwarded to the DataLoadingBlock's apply function | () |
**kwargs | Optional[Any] | forwarded to the DataLoadingBlock's apply function | {} |
Returns: the output of the DataLoadingBlock's apply function, or the default_ret_value when dlp is None or it does not contain the requested loading block
Source code in fedbiomed/common/data/_data_loading_plan.py
def apply_dlb(self, default_ret_value: Any, dlb_key: DataLoadingBlockTypes,
*args: Optional[Any], **kwargs: Optional[Any]) -> Any:
"""Apply one DataLoadingBlock identified by its key.
Note that we want to easily support the case where the DataLoadingPlan
is not activated, or the requested loading block is not contained in the
DataLoadingPlan. This is achieved by providing a default return value
to be returned when the above conditions are met. Hence, most of the
calls to apply_dlb will look like this:
```
value = self.apply_dlb(value, 'my-loading-block', my_apply_args)
```
This will ensure that value is not changed if the DataLoadingPlan is
not active.
Args:
default_ret_value: the value to be returned in case that the dlp
functionality is not required
dlb_key: the key of the DataLoadingBlock to be applied
*args: forwarded to the DataLoadingBlock's apply function
**kwargs: forwarded to the DataLoadingBlock's apply function
Returns:
the output of the DataLoadingBlock's apply function, or
the default_ret_value when dlp is None or it does not contain
the requested loading block
"""
if not isinstance(dlb_key, DataLoadingBlockTypes):
raise FedbiomedDataLoadingPlanValueError(f"Key {dlb_key} is not of enum type DataLoadingBlockTypes"
f" in DataLoadingPlanMixin.apply_dlb")
if self._dlp is not None and dlb_key in self._dlp:
return self._dlp[dlb_key].apply(*args, **kwargs)
else:
return default_ret_value
clear_dlp
clear_dlp()
Source code in fedbiomed/common/data/_data_loading_plan.py
def clear_dlp(self):
self._dlp = None
set_dlp
set_dlp(dlp)
Sets the dlp if the target dataset type is appropriate
Source code in fedbiomed/common/data/_data_loading_plan.py
def set_dlp(self, dlp: DataLoadingPlan):
"""Sets the dlp if the target dataset type is appropriate"""
if not isinstance(dlp, DataLoadingPlan):
msg = f"{ErrorNumbers.FB615.value} Trying to set a DataLoadingPlan but the argument is of type " + \
f"{type(dlp).__name__}"
logger.debug(msg)
raise FedbiomedDataLoadingPlanValueError(msg)
dataset_type = DataLoadingPlan.infer_dataset_type(self) # `self` here will refer to the Dataset instance
if dlp.target_dataset_type != DatasetTypes.NONE and dataset_type != dlp.target_dataset_type:
raise FedbiomedDataLoadingPlanValueError(f"Trying to set {dlp} on dataset of type {dataset_type.value} but "
f"the target type is {dlp.target_dataset_type}")
elif dlp.target_dataset_type == DatasetTypes.NONE:
dlp.target_dataset_type = dataset_type
self._dlp = dlp
DataManager
DataManager(dataset, target=None, **kwargs)
Bases: object
Factory class that build different data loader/datasets based on the type of dataset
. The argument dataset
should be provided as torch.utils.data.Dataset
object for to be used in PyTorch training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | Union[ndarray, DataFrame, Series, Dataset] | Dataset object. It can be an instance, PyTorch Dataset or Tuple. | required |
target | Union[ndarray, DataFrame, Series] | Target variable or variables. | None |
**kwargs | dict | Additional parameters that are going to be used for data loader | {} |
Source code in fedbiomed/common/data/_data_manager.py
def __init__(self,
dataset: Union[np.ndarray, pd.DataFrame, pd.Series, Dataset],
target: Union[np.ndarray, pd.DataFrame, pd.Series] = None,
**kwargs: dict) -> None:
"""Constructor of DataManager,
Args:
dataset: Dataset object. It can be an instance, PyTorch Dataset or Tuple.
target: Target variable or variables.
**kwargs: Additional parameters that are going to be used for data loader
"""
# TODO: Improve datamanager for auto loading by given dataset_path and other information
# such as inputs variable indexes and target variables indexes
self._dataset = dataset
self._target = target
self._loader_arguments: Dict = kwargs
self._data_manager_instance = None
Functions
extend_loader_args
extend_loader_args(extension)
Extends the class' loader arguments
Extends the class's _loader_arguments
attribute with additional key-values from the extension
argument. If a key already exists in the _loader_arguments
, then it is not replaced.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extension | Dict | the mapping used to extend the loader arguments | required |
Source code in fedbiomed/common/data/_data_manager.py
def extend_loader_args(self, extension: Dict):
"""Extends the class' loader arguments
Extends the class's `_loader_arguments` attribute with additional key-values from
the `extension` argument. If a key already exists in the `_loader_arguments`, then
it is not replaced.
Args:
extension: the mapping used to extend the loader arguments
"""
self._loader_arguments.update(
{key: value for key, value in extension.items() if key not in self._loader_arguments}
)
load
load(tp_type)
Loads proper DataManager based on given TrainingPlan and dataset
, target
attributes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tp_type | TrainingPlans | Enumeration instance of TrainingPlans that stands for type of training plan. | required |
Raises:
Type | Description |
---|---|
FedbiomedDataManagerError | If requested DataManager does not match with given arguments. |
Source code in fedbiomed/common/data/_data_manager.py
def load(self, tp_type: TrainingPlans):
"""Loads proper DataManager based on given TrainingPlan and
`dataset`, `target` attributes.
Args:
tp_type: Enumeration instance of TrainingPlans that stands for type of training plan.
Raises:
FedbiomedDataManagerError: If requested DataManager does not match with given arguments.
"""
# Training plan is type of TorcTrainingPlan
if tp_type == TrainingPlans.TorchTrainingPlan:
if self._target is None and isinstance(self._dataset, Dataset):
# Create Dataset for pytorch
self._data_manager_instance = TorchDataManager(dataset=self._dataset, **self._loader_arguments)
elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
# If `dataset` and `target` attributes are array-like object
# create TabularDataset object to instantiate a TorchDataManager
torch_dataset = TabularDataset(inputs=self._dataset, target=self._target)
self._data_manager_instance = TorchDataManager(dataset=torch_dataset, **self._loader_arguments)
else:
raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Invalid arguments for torch based "
f"training plan, either provide the argument `dataset` as PyTorch "
f"Dataset instance, or provide `dataset` and `target` arguments as "
f"an instance one of pd.DataFrame, pd.Series or np.ndarray ")
elif tp_type == TrainingPlans.SkLearnTrainingPlan:
# Try to convert `torch.utils.Data.Dataset` to SkLearnBased dataset/datamanager
if self._target is None and isinstance(self._dataset, Dataset):
torch_data_manager = TorchDataManager(dataset=self._dataset)
try:
self._data_manager_instance = torch_data_manager.to_sklearn()
except Exception as e:
raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: PyTorch based `Dataset` object "
"has been instantiated with DataManager. An error occurred while"
"trying to convert torch.utils.data.Dataset to numpy based "
f"dataset: {str(e)}")
# For scikit-learn based training plans, the arguments `dataset` and `target` should be an instance
# one of `pd.DataFrame`, `pd.Series`, `np.ndarray`
elif isinstance(self._dataset, (pd.DataFrame, pd.Series, np.ndarray)) and \
isinstance(self._target, (pd.DataFrame, pd.Series, np.ndarray)):
# Create Dataset for SkLearn training plans
self._data_manager_instance = SkLearnDataManager(inputs=self._dataset, target=self._target,
**self._loader_arguments)
else:
raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: The argument `dataset` and `target` "
f"should be instance of pd.DataFrame, pd.Series or np.ndarray ")
else:
raise FedbiomedDataManagerError(f"{ErrorNumbers.FB607.value}: Undefined training plan")
FlambyDataset
FlambyDataset()
Bases: DataLoadingPlanMixin
, Dataset
A federated Flamby dataset.
A FlambyDataset is a wrapper around a flamby FedClass instance, adding functionalities and interfaces that are specific to Fed-BioMed.
A FlambyDataset is always created in an empty state, and it requires a DataLoadingPlan to be finalized to a correct state. The DataLoadingPlan must contain at least the following DataLoadinBlock key-value pair: - FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA : FlambyDatasetMetadataBlock
The lifecycle of the DataLoadingPlan and the wrapped FedClass are tightly interlinked: when the DataLoadingPlan is set, the wrapped FedClass is initialized and instantiated. When the DataLoadingPlan is cleared, the wrapped FedClass is also cleared. Hence, an invariant of this class is that the self._dlp and self.__flamby_fed_class should always be either both None, or both set to some value.
Attributes:
Name | Type | Description |
---|---|---|
_transform | a transform function of type MonaiTransform or TorchTransform that will be applied to every sample when data is loaded. | |
__flamby_fed_class | a private instance of the wrapped Flamby FedClass |
Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
super().__init__()
self.__flamby_fed_class = None
self._transform = None
Functions
clear_dlp
clear_dlp()
Clears dlp and automatically clears the FedClass
Tries to guarantee some semblance of integrity by also clearing the FedClass, since setting the dlp initializes it.
Source code in fedbiomed/common/data/_flamby_dataset.py
def clear_dlp(self):
"""Clears dlp and automatically clears the FedClass
Tries to guarantee some semblance of integrity by also clearing the FedClass, since setting the dlp
initializes it.
"""
super().clear_dlp()
self._clear()
get_center_id
get_center_id()
Returns the center id. Requires that the DataLoadingPlan has already been set.
Returns:
Type | Description |
---|---|
int | the center id (int). |
Raises: FedbiomedDatasetError: in one of the two scenarios below - if the data loading plan is not set or is malformed. - if the wrapped FedClass is not initialized but the dlp exists
Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=True,
require_uninitialized=False,
message="Flamby dataset is in an inconsistent state: a Data Loading Plan "
"is set but the wrapped FedClass was not initialized.")
@_requires_dlp
def get_center_id(self) -> int:
"""Returns the center id. Requires that the DataLoadingPlan has already been set.
Returns:
the center id (int).
Raises:
FedbiomedDatasetError: in one of the two scenarios below
- if the data loading plan is not set or is malformed.
- if the wrapped FedClass is not initialized but the dlp exists
"""
return self.apply_dlb(None, FlambyLoadingBlockTypes.FLAMBY_DATASET_METADATA)['flamby_center_id']
get_dataset_type staticmethod
get_dataset_type()
Returns the Flamby DatasetType
Source code in fedbiomed/common/data/_flamby_dataset.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
"""Returns the Flamby DatasetType"""
return DatasetTypes.FLAMBY
get_flamby_fed_class
get_flamby_fed_class()
Returns the instance of the wrapped Flamby FedClass
Source code in fedbiomed/common/data/_flamby_dataset.py
def get_flamby_fed_class(self):
"""Returns the instance of the wrapped Flamby FedClass"""
return self.__flamby_fed_class
get_transform
get_transform()
Gets the transform attribute
Source code in fedbiomed/common/data/_flamby_dataset.py
def get_transform(self):
"""Gets the transform attribute"""
return self._transform
init_transform
init_transform(transform)
Initializes the transform attribute. Must be called before initialization of the wrapped FedClass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform | Union[Compose, Compose] | a composed transform of type torchvision.transforms.Compose or monai.transforms.Compose | required |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | if the wrapped FedClass was already initialized. |
FedbiomedDatasetValueError | if the input is not of the correct type. |
Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=False,
require_uninitialized=True,
message="Calling init_transform is not allowed if the wrapped FedClass "
"has already been initialized. At your own risk, you may call "
"clear_dlp to reset the full FlambyDataset")
def init_transform(self, transform: Union[MonaiCompose, TorchCompose]) -> Union[MonaiCompose, TorchCompose]:
"""Initializes the transform attribute. Must be called before initialization of the wrapped FedClass.
Arguments:
transform: a composed transform of type torchvision.transforms.Compose or monai.transforms.Compose
Raises:
FedbiomedDatasetError: if the wrapped FedClass was already initialized.
FedbiomedDatasetValueError: if the input is not of the correct type.
"""
if not isinstance(transform, (MonaiCompose, TorchCompose)):
msg = f"{ErrorNumbers.FB618.value}. FlambyDataset transform must be of type " \
f"torchvision.transforms.Compose or monai.transforms.Compose"
logger.critical(msg)
raise FedbiomedDatasetValueError(msg)
self._transform = transform
return self._transform
set_dlp
set_dlp(dlp)
Sets the Data Loading Plan and ensures that the flamby_fed_class is initialized
Overrides the set_dlp function from the DataLoadingPlanMixin to make sure that self._init_flamby_fed_class is also called immediately after.
Source code in fedbiomed/common/data/_flamby_dataset.py
def set_dlp(self, dlp):
"""Sets the Data Loading Plan and ensures that the flamby_fed_class is initialized
Overrides the set_dlp function from the DataLoadingPlanMixin to make sure that self._init_flamby_fed_class
is also called immediately after.
"""
super().set_dlp(dlp)
try:
self._init_flamby_fed_class()
except FedbiomedDatasetError as e:
# clean up
super().clear_dlp()
raise FedbiomedDatasetError from e
shape
shape()
Returns the shape of the flamby_fed_class
Source code in fedbiomed/common/data/_flamby_dataset.py
@_check_fed_class_initialization_status(require_initialized=True,
require_uninitialized=False,
message="Cannot compute shape because FedClass was not initialized.")
def shape(self) -> List[int]:
"""Returns the shape of the flamby_fed_class"""
return [len(self)] + list(self.__getitem__(0)[0].shape)
FlambyDatasetMetadataBlock
FlambyDatasetMetadataBlock()
Bases: DataLoadingBlock
Metadata about a Flamby Dataset.
Includes information on: - identity of the type of flamby dataset (e.g. fed_ixi, fed_heart, etc...) - the ID of the center of the flamby dataset
Source code in fedbiomed/common/data/_flamby_dataset.py
def __init__(self):
super().__init__()
self.metadata = {
"flamby_dataset_name": None,
"flamby_center_id": None
}
self._serialization_validator.update_validation_scheme(
FlambyDatasetMetadataBlock._extra_validation_scheme())
Attributes
metadata instance-attribute
metadata = {'flamby_dataset_name': None, 'flamby_center_id': None}
Functions
apply
apply()
Returns a dictionary of dataset metadata.
The metadata dictionary contains: - flamby_dataset_name: (str) the name of the selected flamby dataset. - flamby_center_id: (int) the center id selected at dataset add time.
Note that the flamby_dataset_name will be the same as the module name required to instantiate the FedClass. However, it will not contain the full module path, hence to properly import this module it must be prepended with flamby.datasets
, for example import flamby.datasets.flamby_dataset_name
Returns:
Type | Description |
---|---|
dict | this data loading block's metadata |
Source code in fedbiomed/common/data/_flamby_dataset.py
def apply(self) -> dict:
"""Returns a dictionary of dataset metadata.
The metadata dictionary contains:
- flamby_dataset_name: (str) the name of the selected flamby dataset.
- flamby_center_id: (int) the center id selected at dataset add time.
Note that the flamby_dataset_name will be the same as the module name required to instantiate the FedClass.
However, it will not contain the full module path, hence to properly import this module it must be
prepended with `flamby.datasets`, for example `import flamby.datasets.flamby_dataset_name`
Returns:
this data loading block's metadata
"""
if any([v is None for v in self.metadata.values()]):
msg = f"{ErrorNumbers.FB316}. Attempting to read Flamby dataset metadata, but " \
f"the {[k for k,v in self.metadata.items() if v is None]} keys were not previously set."
logger.critical(msg)
raise FedbiomedLoadingBlockError(msg)
return self.metadata
deserialize
deserialize(load_from)
Reconstruct the DataLoadingBlock from a serialized version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
load_from | dict | a dictionary as obtained by the serialize function. | required |
Returns: the self instance
Source code in fedbiomed/common/data/_flamby_dataset.py
def deserialize(self, load_from: dict) -> DataLoadingBlock:
"""Reconstruct the DataLoadingBlock from a serialized version.
Args:
load_from: a dictionary as obtained by the serialize function.
Returns:
the self instance
"""
super().deserialize(load_from)
self.metadata['flamby_dataset_name'] = load_from['flamby_dataset_name']
self.metadata['flamby_center_id'] = load_from['flamby_center_id']
return self
serialize
serialize()
Serializes the class in a format similar to json.
Returns:
Type | Description |
---|---|
dict | a dictionary of key-value pairs sufficient for reconstructing |
dict | the DataLoadingBlock. |
Source code in fedbiomed/common/data/_flamby_dataset.py
def serialize(self) -> dict:
"""Serializes the class in a format similar to json.
Returns:
a dictionary of key-value pairs sufficient for reconstructing
the DataLoadingBlock.
"""
ret = super().serialize()
ret.update({'flamby_dataset_name': self.metadata['flamby_dataset_name'],
'flamby_center_id': self.metadata['flamby_center_id']
})
return ret
FlambyLoadingBlockTypes
FlambyLoadingBlockTypes(*args)
Bases: DataLoadingBlockTypes
, Enum
Additional DataLoadingBlockTypes specific to Flamby data
Source code in fedbiomed/common/constants.py
def __init__(self, *args):
cls = self.__class__
if not isinstance(self.value, str):
raise ValueError("all fields of DataLoadingBlockTypes subclasses"
" must be of str type")
if any(self.value == e.value for e in cls):
a = self.name
e = cls(self.value).name
raise ValueError(
f"duplicate values not allowed in DataLoadingBlockTypes and "
f"its subclasses: {a} --> {e}")
Attributes
FLAMBY_DATASET_METADATA class-attribute
instance-attribute
FLAMBY_DATASET_METADATA = 'flamby_dataset_metadata'
MapperBlock
MapperBlock()
Bases: DataLoadingBlock
A DataLoadingBlock for mapping values.
This DataLoadingBlock can be used whenever an "indirect mapping" is needed. For example, it can be used to implement a correspondence between a set of "logical" abstract names and a set of folder names on the filesystem.
The apply function of this DataLoadingBlock takes a "key" as input (a str) and returns the mapped value corresponding to map[key]. Note that while the constructor of this class sets a value for type_id, developers are recommended to set a more meaningful value that better speaks to their application.
Multiple instances of this loading_block may be used in the same DataLoadingPlan, provided that they are given different type_id via the constructor.
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
super(MapperBlock, self).__init__()
self.map = {}
self._serialization_validator.update_validation_scheme(MapperBlock._extra_validation_scheme())
Attributes
map instance-attribute
map = {}
Functions
apply
apply(key)
Returns the value mapped to the key, if it exists.
Raises:
Type | Description |
---|---|
FedbiomedLoadingBlockError | if map is not a dict or the key does not exist. |
Source code in fedbiomed/common/data/_data_loading_plan.py
def apply(self, key):
"""Returns the value mapped to the key, if it exists.
Raises:
FedbiomedLoadingBlockError: if map is not a dict or the key does not exist.
"""
if not isinstance(self.map, dict) or key not in self.map:
msg = f"{ErrorNumbers.FB614.value} Mapper block error: no key '{key}' in mapping dictionary"
logger.debug(msg)
raise FedbiomedLoadingBlockError(msg)
return self.map[key]
deserialize
deserialize(load_from)
Reconstruct the DataLoadingBlock from a serialized version.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
load_from | dict | a dictionary as obtained by the serialize function. | required |
Returns: the self instance
Source code in fedbiomed/common/data/_data_loading_plan.py
def deserialize(self, load_from: dict) -> DataLoadingBlock:
"""Reconstruct the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock]
from a serialized version.
Args:
load_from (dict): a dictionary as obtained by the serialize function.
Returns:
the self instance
"""
super(MapperBlock, self).deserialize(load_from)
self.map = load_from['map']
return self
serialize
serialize()
Serializes the class in a format similar to json.
Returns:
Type | Description |
---|---|
dict | a dictionary of key-value pairs sufficient for reconstructing |
dict | the DataLoadingBlock. |
Source code in fedbiomed/common/data/_data_loading_plan.py
def serialize(self) -> dict:
"""Serializes the class in a format similar to json.
Returns:
a dictionary of key-value pairs sufficient for reconstructing
the [DataLoadingBlock][fedbiomed.common.data._data_loading_plan.DataLoadingBlock].
"""
ret = super(MapperBlock, self).serialize()
ret.update({'map': self.map})
return ret
MedicalFolderBase
MedicalFolderBase(root=None)
Bases: DataLoadingPlanMixin
Controller class for Medical Folder dataset.
Contains methods to validate the MedicalFolder folder hierarchy and extract folder-base metadata information such as modalities, number of subject etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, Path, None] | path to Medical Folder root folder. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, Path, None] = None):
"""Constructs MedicalFolderBase
Args:
root: path to Medical Folder root folder.
"""
super(MedicalFolderBase, self).__init__()
if root is not None:
root = self.validate_MedicalFolder_root_folder(root)
self._root = root
Attributes
default_modality_names class-attribute
instance-attribute
default_modality_names = ['T1', 'T2', 'label']
root property
writable
root
Root property of MedicalFolderController
Functions
available_subjects
available_subjects(subjects_from_index, subjects_from_folder=None)
Checks missing subject folders and missing entries in demographics
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subjects_from_index | Union[list, Series] | Given subject folder names in demographics | required |
subjects_from_folder | list | List of subject folder names to get intersection of given subject_from_index | None |
Returns:
Name | Type | Description |
---|---|---|
available_subjects | list[str] | subjects that have an imaging data folder and are also present in the demographics file |
missing_subject_folders | list[str] | subjects that are in the demographics file but do not have an imaging data folder |
missing_entries | list[str] | subjects that have an imaging data folder but are not present in the demographics file |
Source code in fedbiomed/common/data/_medical_datasets.py
def available_subjects(self,
subjects_from_index: Union[list, pd.Series],
subjects_from_folder: list = None) -> tuple[list[str], list[str], list[str]]:
"""Checks missing subject folders and missing entries in demographics
Args:
subjects_from_index: Given subject folder names in demographics
subjects_from_folder: List of subject folder names to get intersection of given subject_from_index
Returns:
available_subjects: subjects that have an imaging data folder and are also present in the demographics file
missing_subject_folders: subjects that are in the demographics file but do not have an imaging data folder
missing_entries: subjects that have an imaging data folder but are not present in the demographics file
"""
# Select all subject folders if it is not given
if subjects_from_folder is None:
subjects_from_folder = self.subjects_with_imaging_data_folders()
# Missing subject that will cause warnings
missing_subject_folders = list(set(subjects_from_index) - set(subjects_from_folder))
# Missing entries that will cause errors
missing_entries = list(set(subjects_from_folder) - set(subjects_from_index))
# Intersection
available_subjects = list(set(subjects_from_index).intersection(set(subjects_from_folder)))
return available_subjects, missing_subject_folders, missing_entries
complete_subjects
complete_subjects(subjects, modalities)
Retrieves subjects that have given all the modalities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subjects | List[str] | List of subject folder names | required |
modalities | List[str] | List of required modalities | required |
Returns:
Type | Description |
---|---|
List[str] | List of subject folder names that have required modalities |
Source code in fedbiomed/common/data/_medical_datasets.py
def complete_subjects(self, subjects: List[str], modalities: List[str]) -> List[str]:
"""Retrieves subjects that have given all the modalities.
Args:
subjects: List of subject folder names
modalities: List of required modalities
Returns:
List of subject folder names that have required modalities
"""
return [subject for subject in subjects if all(self.is_modalities_existing(subject, modalities))]
demographics_column_names staticmethod
demographics_column_names(path)
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def demographics_column_names(path: Union[str, Path]):
return MedicalFolderBase.read_demographics(path).columns.values
get_dataset_type staticmethod
get_dataset_type()
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
return DatasetTypes.MEDICAL_FOLDER
is_modalities_existing
is_modalities_existing(subject, modalities)
Checks whether given modalities exists in the subject directory
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subject | str | Subject ID or subject folder name | required |
modalities | List[str] | List of modalities to check | required |
Returns:
Type | Description |
---|---|
List[bool] | List of |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | bad argument type |
Source code in fedbiomed/common/data/_medical_datasets.py
def is_modalities_existing(self, subject: str, modalities: List[str]) -> List[bool]:
"""Checks whether given modalities exists in the subject directory
Args:
subject: Subject ID or subject folder name
modalities: List of modalities to check
Returns:
List of `bool` that represents whether modality is existing respectively for each of modality.
Raises:
FedbiomedDatasetError: bad argument type
"""
if not isinstance(subject, str):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected string for subject folder/ID, "
f"but got {type(subject)}")
if not isinstance(modalities, list):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected a list for modalities, "
f"but got {type(modalities)}")
if not all([type(m) is str for m in modalities]):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected a list of string for modalities, "
f"but some modalities are "
f"{' '.join([ str(type(m) for m in modalities if type(m) != str)])}")
are_modalities_existing = list()
for modality in modalities:
modality_folder = self._subject_modality_folder(subject, modality)
are_modalities_existing.append(bool(modality_folder) and
self._root.joinpath(subject, modality_folder).is_dir())
return are_modalities_existing
modalities
modalities()
Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.
Returns:
Type | Description |
---|---|
list | List of unique available modalities |
list | List of all encountered modality folders in each subject folder, appearing once per folder |
Source code in fedbiomed/common/data/_medical_datasets.py
def modalities(self) -> Tuple[list, list]:
"""Gets all modalities based either on all possible candidates or those provided by the DataLoadingPlan.
Returns:
List of unique available modalities
List of all encountered modality folders in each subject folder, appearing once per folder
"""
modality_candidates, modality_folders_list = self.modalities_candidates_from_subfolders()
if self._dlp is not None and MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS in self._dlp:
modalities = list(self._dlp[MedicalFolderLoadingBlockTypes.MODALITIES_TO_FOLDERS].map.keys())
return modalities, modality_folders_list
else:
return modality_candidates, modality_folders_list
modalities_candidates_from_subfolders
modalities_candidates_from_subfolders()
Gets all possible modality folders under root directory
Returns:
Type | Description |
---|---|
list | List of unique available modality folders appearing at least once |
list | List of all encountered modality folders in each subject folder, appearing once per folder |
Source code in fedbiomed/common/data/_medical_datasets.py
def modalities_candidates_from_subfolders(self) -> Tuple[list, list]:
""" Gets all possible modality folders under root directory
Returns:
List of unique available modality folders appearing at least once
List of all encountered modality folders in each subject folder, appearing once per folder
"""
# Accept only folders that don't start with "." and "_"
modalities = [f.name for f in self._root.glob("*/*") if f.is_dir() and not f.name.startswith((".", "_"))]
return sorted(list(set(modalities))), modalities
read_demographics staticmethod
read_demographics(path, index_col=None)
Read demographics tabular file for Medical Folder dataset
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | bad file format |
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def read_demographics(path: Union[str, Path], index_col: Optional[int] = None):
""" Read demographics tabular file for Medical Folder dataset
Raises:
FedbiomedDatasetError: bad file format
"""
path = Path(path)
if not path.is_file() or path.suffix.lower() not in [".csv", ".tsv"]:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Demographics should be CSV or TSV files")
return pd.read_csv(path, index_col=index_col, engine='python')
subjects_with_imaging_data_folders
subjects_with_imaging_data_folders()
Retrieves subject folder names under Medical Folder root directory.
Returns:
Type | Description |
---|---|
List[str] | subject folder names under Medical Folder root directory. |
Source code in fedbiomed/common/data/_medical_datasets.py
def subjects_with_imaging_data_folders(self) -> List[str]:
"""Retrieves subject folder names under Medical Folder root directory.
Returns:
subject folder names under Medical Folder root directory.
"""
return [f.name for f in self._root.iterdir() if f.is_dir() and not f.name.startswith(".")]
validate_MedicalFolder_root_folder staticmethod
validate_MedicalFolder_root_folder(path)
Validates Medical Folder root directory by checking folder structure
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | Union[str, Path] | path to root directory | required |
Returns:
Type | Description |
---|---|
Path | Path to root folder of Medical Folder dataset |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError |
|
Source code in fedbiomed/common/data/_medical_datasets.py
@staticmethod
def validate_MedicalFolder_root_folder(path: Union[str, Path]) -> Path:
""" Validates Medical Folder root directory by checking folder structure
Args:
path: path to root directory
Returns:
Path to root folder of Medical Folder dataset
Raises:
FedbiomedDatasetError: - If path is not an instance of `str` or `pathlib.Path`
- If path is not a directory
"""
if not isinstance(path, (Path, str)):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: The argument root should an instance of "
f"`Path` or `str`, but got {type(path)}")
if not isinstance(path, Path):
path = Path(path)
path = Path(path).expanduser().resolve()
if not path.exists():
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Folder or file {path} not found on system")
if not path.is_dir():
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Root for Medical Folder dataset "
f"should be a directory.")
directories = [f for f in path.iterdir() if f.is_dir()]
if len(directories) == 0:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Root folder of Medical Folder should "
f"contain subject folders, but no sub folder has been found. ")
modalities = [f for f in path.glob("*/*") if f.is_dir()]
if len(modalities) == 0:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value} Subject folders for Medical Folder should "
f"contain modalities as folders. Folder structure should be "
f"root/<subjects>/<modalities>")
return path
MedicalFolderController
MedicalFolderController(root=None)
Bases: MedicalFolderBase
Utility class to construct and verify Medical Folder datasets without knowledge of the experiment.
The purpose of this class is to enable key functionalities related to the MedicalFolderDataset at the time of dataset deployment, i.e. when the data is being added to the node's database.
Specifically, the MedicalFolderController class can be used to: - construct a MedicalFolderDataset with all available data modalities, without knowing which ones will be used as targets or features during an experiment - validate that the proper folder structure has been respected by the data managers preparing the data - identify which subjects have which modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | str | Folder path to dataset. Defaults to None. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: str = None):
"""Constructs MedicalFolderController
Args:
root: Folder path to dataset. Defaults to None.
"""
super(MedicalFolderController, self).__init__(root=root)
Functions
load_MedicalFolder
load_MedicalFolder(tabular_file=None, index_col=None)
Load Medical Folder dataset with given tabular_file and index_col
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tabular_file | Union[str, Path] | File path to demographics data set | None |
index_col | Union[str, int] | Column index that represents subject folder names | None |
Returns:
Type | Description |
---|---|
MedicalFolderDataset | MedicalFolderDataset object |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | If Medical Folder dataset is not successfully loaded |
Source code in fedbiomed/common/data/_medical_datasets.py
def load_MedicalFolder(self,
tabular_file: Union[str, Path] = None,
index_col: Union[str, int] = None) -> MedicalFolderDataset:
""" Load Medical Folder dataset with given tabular_file and index_col
Args:
tabular_file: File path to demographics data set
index_col: Column index that represents subject folder names
Returns:
MedicalFolderDataset object
Raises:
FedbiomedDatasetError: If Medical Folder dataset is not successfully loaded
"""
if self._root is None:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Can not load Medical Folder dataset without "
f"declaring root directory. Please set root or build MedicalFolderController "
f"with by providing `root` argument use")
modalities, _ = self.modalities()
try:
dataset = MedicalFolderDataset(root=self._root,
tabular_file=tabular_file,
index_col=index_col,
data_modalities=modalities,
target_modalities=modalities)
except FedbiomedError as e:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Can not create Medical Folder dataset. {e}")
if self._dlp is not None:
dataset.set_dlp(self._dlp)
return dataset
subject_modality_status
subject_modality_status(index=None)
Scans subjects and checks which modalities are existing for each subject
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index | Union[List, Series] | Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject folder names. Defaults to None. | None |
Returns: Modality status for each subject that indicates which modalities are available
Source code in fedbiomed/common/data/_medical_datasets.py
def subject_modality_status(self, index: Union[List, pd.Series] = None) -> Dict:
"""Scans subjects and checks which modalities are existing for each subject
Args:
index: Array-like index that comes from reference csv file of Medical Folder dataset. It represents subject
folder names. Defaults to None.
Returns:
Modality status for each subject that indicates which modalities are available
"""
modalities, _ = self.modalities()
subjects = self.subjects_with_imaging_data_folders()
modality_status = {"columns": [*modalities], "data": [], "index": []}
if index is not None:
_, missing_subjects, missing_entries = self.available_subjects(subjects_from_index=index)
modality_status["columns"].extend(["in_folder", "in_index"])
for subject in subjects:
modality_report = self.is_modalities_existing(subject, modalities)
status_list = [status for status in modality_report]
if index is not None:
status_list.append(False if subject in missing_subjects else True)
status_list.append(False if subject in missing_entries else True)
modality_status["data"].append(status_list)
modality_status["index"].append(subject)
return modality_status
MedicalFolderDataset
MedicalFolderDataset(root, data_modalities='T1', transform=None, target_modalities='label', target_transform=None, demographics_transform=None, tabular_file=None, index_col=None)
Bases: Dataset
, MedicalFolderBase
Torch dataset following the Medical Folder Structure.
The Medical Folder structure is loosely inspired by the BIDS standard [1]. It should respect the following pattern:
└─ MedicalFolder_root/
└─ demographics.csv
└─ sub-01/
├─ T1/
│ └─ sub-01_xxx.nii.gz
└─ T2/
├─ sub-01_xxx.nii.gz
[1] https://bids.neuroimaging.io/
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, PathLike, Path] | Root folder containing all the subject directories. | required |
data_modalities | (str, Iterable) | Modality or modalities to be used as data sources. | 'T1' |
transform | Union[Callable, Dict[str, Callable]] | A function or dict of function transform(s) that preprocess each data source. | None |
target_modalities | Optional[Union[str, Iterable[str]]] | (str, Iterable): Modality or modalities to be used as target sources. | 'label' |
target_transform | Union[Callable, Dict[str, Callable]] | A function or dict of function transform(s) that preprocess each target source. | None |
demographics_transform | Optional[Callable] | TODO | None |
tabular_file | Union[str, PathLike, Path, None] | Path to a CSV or Excel file containing the demographic information from the patients. | None |
index_col | Union[int, str, None] | Column name in the tabular file containing the subject ids which mush match the folder names. | None |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self,
root: Union[str, PathLike, Path],
data_modalities: Optional[Union[str, Iterable[str]]] = 'T1',
transform: Union[Callable, Dict[str, Callable]] = None,
target_modalities: Optional[Union[str, Iterable[str]]] = 'label',
target_transform: Union[Callable, Dict[str, Callable]] = None,
demographics_transform: Optional[Callable] = None,
tabular_file: Union[str, PathLike, Path, None] = None,
index_col: Union[int, str, None] = None,
):
"""Constructor for class `MedicalFolderDataset`.
Args:
root: Root folder containing all the subject directories.
data_modalities (str, Iterable): Modality or modalities to be used as data sources.
transform: A function or dict of function transform(s) that preprocess each data source.
target_modalities: (str, Iterable): Modality or modalities to be used as target sources.
target_transform: A function or dict of function transform(s) that preprocess each target source.
demographics_transform: TODO
tabular_file: Path to a CSV or Excel file containing the demographic information from the patients.
index_col: Column name in the tabular file containing the subject ids which mush match the folder names.
"""
super(MedicalFolderDataset, self).__init__(root=root)
self._tabular_file = tabular_file
self._index_col = index_col
self._data_modalities = [data_modalities] if isinstance(data_modalities, str) else data_modalities
self._target_modalities = [target_modalities] if isinstance(target_modalities, str) else target_modalities
self._transform = self._check_and_reformat_transforms(transform, data_modalities)
self._target_transform = self._check_and_reformat_transforms(target_transform, target_modalities)
self._demographics_transform = demographics_transform if demographics_transform is not None else lambda x: {}
# Image loader
self._reader = Compose([
LoadImage(ITKReader(), image_only=True),
ToTensor()
])
Attributes
ALLOWED_EXTENSIONS class-attribute
instance-attribute
ALLOWED_EXTENSIONS = ['.nii', '.nii.gz']
demographics cached
property
demographics
Loads tabular data file (supports excel, csv, tsv and colon separated value files).
index_col property
writable
index_col
Getter/setter of the column containing folder's name (in the tabular file)
subjects_has_all_modalities property
subjects_has_all_modalities
Gets only the subjects that have all required modalities
subjects_registered_in_demographics cached
property
subjects_registered_in_demographics
Gets the subject only those who are present in the demographics file.
tabular_file property
writable
tabular_file
Functions
get_nontransformed_item
get_nontransformed_item(item)
Source code in fedbiomed/common/data/_medical_datasets.py
def get_nontransformed_item(self, item):
# For the first item retrieve complete subject folders
subjects = self.subject_folders()
if not subjects:
# case where subjects is an empty list (subject folders have not been found)
raise FedbiomedDatasetError(
f"{ErrorNumbers.FB613.value}: Cannot find complete subject folders with all the modalities")
# Get subject folder
subject_folder = subjects[item]
# Load data modalities
data = self.load_images(subject_folder, modalities=self._data_modalities)
# Load target modalities
targets = self.load_images(subject_folder, modalities=self._target_modalities)
# Demographics
demographics = self._get_from_demographics(subject_id=subject_folder.name)
return (data, demographics), targets
load_images
load_images(subject_folder, modalities)
Loads modality images in given subject folder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subject_folder | Path | Subject folder where modalities are stored | required |
modalities | list | List of available modalities | required |
Returns:
Type | Description |
---|---|
Dict[str, Tensor] | Subject image data as victories where keys represent each modality. |
Source code in fedbiomed/common/data/_medical_datasets.py
def load_images(self, subject_folder: Path, modalities: list) -> Dict[str, torch.Tensor]:
"""Loads modality images in given subject folder
Args:
subject_folder: Subject folder where modalities are stored
modalities: List of available modalities
Returns:
Subject image data as victories where keys represent each modality.
"""
subject_data = {}
for modality in modalities:
modality_folder = self._subject_modality_folder(subject_folder, modality)
image_folder = subject_folder.joinpath(modality_folder)
nii_files = [p.resolve() for p in image_folder.glob("**/*")
if ''.join(p.suffixes) in self.ALLOWED_EXTENSIONS]
# Load the first, we assume there is going to be a single image per modality for now.
img_path = nii_files[0]
img = self._reader(img_path)
subject_data[modality] = img
return subject_data
set_dataset_parameters
set_dataset_parameters(parameters)
Sets dataset parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
parameters | dict | Parameters to initialize | required |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | If given parameters are not of |
Source code in fedbiomed/common/data/_medical_datasets.py
def set_dataset_parameters(self, parameters: dict):
"""Sets dataset parameters.
Args:
parameters: Parameters to initialize
Raises:
FedbiomedDatasetError: If given parameters are not of `dict` type
"""
if not isinstance(parameters, dict):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Expected type for `parameters` is `dict, "
f"but got {type(parameters)}`")
for key, value in parameters.items():
if hasattr(self, key):
setattr(self, key, value)
else:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB613.value}: Trying to set non existing attribute '{key}'")
shape
shape()
Retrieves shape information for modalities and demographics csv
Source code in fedbiomed/common/data/_medical_datasets.py
def shape(self) -> dict:
"""Retrieves shape information for modalities and demographics csv"""
# Get all modalities
data_modalities = list(set(self._data_modalities))
target_modalities = list(set(self._target_modalities))
modalities = list(set(self._data_modalities + self._target_modalities))
(image, _), targets = self.get_nontransformed_item(0)
result = {modality: list(image[modality].shape) for modality in data_modalities}
result.update({modality: list(targets[modality].shape) for modality in target_modalities})
num_modalities = len(modalities)
demographics_shape = self.demographics.shape if self.demographics is not None else None
result.update({"demographics": demographics_shape, "num_modalities": num_modalities})
return result
subject_folders
subject_folders()
Retrieves subject folder names of only those who have their complete modalities
Returns:
Type | Description |
---|---|
List[Path] | List of subject directories that has all requested modalities |
Source code in fedbiomed/common/data/_medical_datasets.py
def subject_folders(self) -> List[Path]:
"""Retrieves subject folder names of only those who have their complete modalities
Returns:
List of subject directories that has all requested modalities
"""
# If demographics are present
if self._tabular_file and self._index_col is not None:
complete_subject_folders = self.subjects_registered_in_demographics
else:
complete_subject_folders = self.subjects_has_all_modalities
return [self._root.joinpath(folder) for folder in complete_subject_folders]
MedicalFolderLoadingBlockTypes
MedicalFolderLoadingBlockTypes(*args)
Bases: DataLoadingBlockTypes
, Enum
Source code in fedbiomed/common/constants.py
def __init__(self, *args):
cls = self.__class__
if not isinstance(self.value, str):
raise ValueError("all fields of DataLoadingBlockTypes subclasses"
" must be of str type")
if any(self.value == e.value for e in cls):
a = self.name
e = cls(self.value).name
raise ValueError(
f"duplicate values not allowed in DataLoadingBlockTypes and "
f"its subclasses: {a} --> {e}")
Attributes
MODALITIES_TO_FOLDERS class-attribute
instance-attribute
MODALITIES_TO_FOLDERS = 'modalities_to_folders'
NIFTIFolderDataset
NIFTIFolderDataset(root, transform=None, target_transform=None)
Bases: Dataset
A Generic class for loading NIFTI Images using the folder structure as the target classes' labels.
Supported formats: - NIFTI and compressed NIFTI files: .nii
, .nii.gz
This is a Dataset useful in classification tasks. Its usage is quite simple, quite similar to torchvision.datasets.ImageFolder
. Images must be contained in first level sub-folders (level 2+ sub-folders are ignored) that describe the target class they belong to (target class label is the name of the folder).
nifti_dataset_root_folder
├── control_group
│ ├── subject_1.nii
│ └── subject_2.nii
│ └── ...
└── disease_group
├── subject_3.nii
└── subject_4.nii
└── ...
In this example, there are 4 samples (one from each *.nii file), 2 target class, with labels control_group
and disease_group
. subject_1.nii
has class label control_group
, subject_3.nii
has class label disease_group
,etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root | Union[str, PathLike, Path] | folder where the data is located. | required |
transform | Union[Callable, None] | transforms to be applied on data. | None |
target_transform | Union[Callable, None] | transforms to be applied on target indexes. | None |
Raises:
Type | Description |
---|---|
FedbiomedDatasetError | bad argument type |
FedbiomedDatasetError | bad root path |
Source code in fedbiomed/common/data/_medical_datasets.py
def __init__(self, root: Union[str, PathLike, Path],
transform: Union[Callable, None] = None,
target_transform: Union[Callable, None] = None
):
"""Constructor of the class
Args:
root: folder where the data is located.
transform: transforms to be applied on data.
target_transform: transforms to be applied on target indexes.
Raises:
FedbiomedDatasetError: bad argument type
FedbiomedDatasetError: bad root path
"""
# check parameters type
for tr, trname in ((transform, 'transform'), (target_transform, 'target_transform')):
if not callable(tr) and tr is not None:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter {trname} has incorrect "
f"type {type(tr)}, cannot create dataset.")
if not isinstance(root, str) and not isinstance(root, PathLike) and not isinstance(root, Path):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB612.value}: Parameter `root` has incorrect type "
f"{type(root)}, cannot create dataset.")
# initialize object variables
self._files = []
self._class_labels = []
self._targets = []
try:
self._root_dir = Path(root).expanduser()
except RuntimeError as e:
raise FedbiomedDatasetError(
f"{ErrorNumbers.FB612.value}: Cannot expand path {root}, error message is: {e}")
self._transform = transform
self._target_transform = target_transform
self._reader = Compose([
LoadImage(ITKReader(), image_only=True),
ToTensor()
])
self._explore_root_folder()
Functions
files
files()
Retrieves the paths to the sample images.
Gives sames order as when retrieving the sample images (eg self.files[0]
is the path to self.__getitem__[0]
)
Returns:
Type | Description |
---|---|
List[Path] | List of the absolute paths to the sample images |
Source code in fedbiomed/common/data/_medical_datasets.py
def files(self) -> List[Path]:
"""Retrieves the paths to the sample images.
Gives sames order as when retrieving the sample images (eg `self.files[0]`
is the path to `self.__getitem__[0]`)
Returns:
List of the absolute paths to the sample images
"""
return self._files
labels
labels()
Retrieves the labels of the target classes.
Target label index is the index of the corresponding label in this list.
Returns:
Type | Description |
---|---|
List[str] | List of the labels of the target classes. |
Source code in fedbiomed/common/data/_medical_datasets.py
def labels(self) -> List[str]:
"""Retrieves the labels of the target classes.
Target label index is the index of the corresponding label in this list.
Returns:
List of the labels of the target classes.
"""
return self._class_labels
NPDataLoader
NPDataLoader(dataset, target, batch_size=1, shuffle=False, random_seed=None, drop_last=False)
DataLoader for a Numpy dataset.
This data loader encapsulates a dataset composed of numpy arrays and presents an Iterable interface. One design principle was to try to make the interface as similar as possible to a torch.DataLoader.
Attributes:
Name | Type | Description |
---|---|---|
_dataset | (np.ndarray) a 2d array of features | |
_target | (np.ndarray) an optional array of target values | |
_batch_size | (int) the number of elements in one batch | |
_shuffle | (bool) if True, shuffle the data at the beginning of every epoch | |
_drop_last | (bool) if True, drop the last batch if it does not contain batch_size elements | |
_rng | (np.random.Generator) the random number generator for shuffling |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | ndarray | 2D Numpy array | required |
target | ndarray | Numpy array of target values | required |
batch_size | int | batch size for each iteration | 1 |
shuffle | bool | shuffle before iteration | False |
random_seed | Optional[int] | an optional integer to set the numpy random seed for shuffling. If it equals None, then no attempt will be made to set the random seed. | None |
drop_last | bool | whether to drop the last batch in case it does not fill the whole batch size | False |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
dataset: np.ndarray,
target: np.ndarray,
batch_size: int = 1,
shuffle: bool = False,
random_seed: Optional[int] = None,
drop_last: bool = False):
"""Construct numpy data loader
Args:
dataset: 2D Numpy array
target: Numpy array of target values
batch_size: batch size for each iteration
shuffle: shuffle before iteration
random_seed: an optional integer to set the numpy random seed for shuffling. If it equals
None, then no attempt will be made to set the random seed.
drop_last: whether to drop the last batch in case it does not fill the whole batch size
"""
if not isinstance(dataset, np.ndarray) or not isinstance(target, np.ndarray):
msg = f"{ErrorNumbers.FB609.value}. Wrong input type for `dataset` or `target` in NPDataLoader. " \
f"Expected type np.ndarray for both, instead got {type(dataset)} and" \
f"{type(target)} respectively."
logger.error(msg)
raise FedbiomedTypeError(msg)
# If the researcher gave a 1-dimensional dataset, we expand it to 2 dimensions
if dataset.ndim == 1:
dataset = dataset[:, np.newaxis]
# If the researcher gave a 1-dimensional target, we expand it to 2 dimensions
if target.ndim == 1:
target = target[:, np.newaxis]
if dataset.ndim != 2 or target.ndim != 2:
raise FedbiomedValueError(
f"{ErrorNumbers.FB609.value}. Wrong shape for `dataset` or `target` in "
f"NPDataLoader. Expected 2-dimensional arrays, instead got {dataset.ndim}- "
f"dimensional and {target.ndim}-dimensional arrays respectively.")
if len(dataset) != len(target):
raise FedbiomedValueError(
f"{ErrorNumbers.FB609.value}. Inconsistent length for `dataset` and `target` "
f"in NPDataLoader. Expected same length, instead got len(dataset)={len(dataset)}, "
f"len(target)={len(target)}")
if not isinstance(batch_size, int) or batch_size <= 0:
raise FedbiomedValueError(
f"{ErrorNumbers.FB609.value}. Wrong value for `batch_size` parameter of "
f"NPDataLoader. Expected a non-zero positive integer, instead got value {batch_size}.")
if random_seed is not None and not isinstance(random_seed, int):
raise FedbiomedTypeError(
f"{ErrorNumbers.FB609.value}. Wrong type for `random_seed` parameter of "
f"NPDataLoader. Expected int or None, instead got {type(random_seed)}.")
self._dataset = dataset
self._target = target
self._batch_size = batch_size
self._shuffle = shuffle
self._drop_last = drop_last
self._rng = np.random.default_rng(random_seed)
Attributes
dataset property
dataset
Returns the encapsulated dataset
This needs to be a property to harmonize the API with torch.DataLoader, enabling us to write generic code for both DataLoaders.
target property
target
Returns the array of target values
This has been made a property to have a homogeneous interface with the dataset property above.
Functions
batch_size
batch_size()
Returns the batch size
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def batch_size(self) -> int:
"""Returns the batch size"""
return self._batch_size
drop_last
drop_last()
Returns the boolean drop_last attribute
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def drop_last(self) -> bool:
"""Returns the boolean drop_last attribute"""
return self._drop_last
n_remainder_samples
n_remainder_samples()
Returns the remainder of the division between dataset length and batch size.
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def n_remainder_samples(self) -> int:
"""Returns the remainder of the division between dataset length and batch size."""
return len(self._dataset) % self._batch_size
rng
rng()
Returns the random number generator
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def rng(self) -> np.random.Generator:
"""Returns the random number generator"""
return self._rng
shuffle
shuffle()
Returns the boolean shuffle attribute
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def shuffle(self) -> bool:
"""Returns the boolean shuffle attribute"""
return self._shuffle
SerializationValidation
SerializationValidation()
Provide Validation capabilities for serializing/deserializing a [DataLoadingBlock] or [DataLoadingPlan].
When a developer inherits from [DataLoadingBlock] to define a custom loading block, they are required to call the _serialization_validator.update_validation_scheme
function with a dictionary argument containing the rules to validate all the additional fields that will be used in the serialization of their loading block.
These rules must follow the syntax explained in the SchemeValidator class.
For example
class MyLoadingBlock(DataLoadingBlock):
def __init__(self):
self.my_custom_data = {}
self._serialization_validator.update_validation_scheme({
'custom_data': {
'rules': [dict, ...any other rules],
'required': True
}
})
def serialize(self):
serialized = super().serialize()
serialized.update({'custom_data': self.my_custom_data})
return serialized
Attributes:
Name | Type | Description |
---|---|---|
_validation_scheme | (dict) an extensible set of rules to validate the DataLoadingBlock metadata. |
Source code in fedbiomed/common/data/_data_loading_plan.py
def __init__(self):
self._validation_scheme = {}
Functions
dlb_default_scheme classmethod
dlb_default_scheme()
The dictionary of default validation rules for a serialized [DataLoadingBlock].
Source code in fedbiomed/common/data/_data_loading_plan.py
@classmethod
def dlb_default_scheme(cls) -> Dict:
"""The dictionary of default validation rules for a serialized [DataLoadingBlock]."""
return {
'loading_block_class': {
'rules': [str, cls._identifier_validation_hook],
'required': True,
},
'loading_block_module': {
'rules': [str, cls._identifier_validation_hook],
'required': True,
},
'dlb_id': {
'rules': [str, cls._serial_id_validation_hook],
'required': True,
},
}
dlp_default_scheme classmethod
dlp_default_scheme()
The dictionary of default validation rules for a serialized [DataLoadingPlan].
Source code in fedbiomed/common/data/_data_loading_plan.py
@classmethod
def dlp_default_scheme(cls) -> Dict:
"""The dictionary of default validation rules for a serialized [DataLoadingPlan]."""
return {
'dlp_id': {
'rules': [str],
'required': True,
},
'dlp_name': {
'rules': [str],
'required': True,
},
'target_dataset_type': {
'rules': [str, cls._target_dataset_type_validator],
'required': True,
},
'loading_blocks': {
'rules': [dict, cls._loading_blocks_types_validator],
'required': True
},
'key_paths': {
'rules': [dict, cls._key_paths_validator],
'required': True
}
}
update_validation_scheme
update_validation_scheme(new_scheme)
Updates the validation scheme.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_scheme | dict | (dict) new dict of rules | required |
Source code in fedbiomed/common/data/_data_loading_plan.py
def update_validation_scheme(self, new_scheme: dict) -> None:
"""Updates the validation scheme.
Args:
new_scheme: (dict) new dict of rules
"""
self._validation_scheme.update(new_scheme)
validate
validate(dlb_metadata, exception_type, only_required=True)
Validate a dict of dlb_metadata according to the _validation_scheme.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dlb_metadata | dict) | the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the node database. | required |
exception_type | Type[FedbiomedError] | the type of the exception to be raised when validation fails. | required |
only_required | bool) | see SchemeValidator.populate_with_defaults | True |
Raises: exception_type: if the validation fails.
Source code in fedbiomed/common/data/_data_loading_plan.py
def validate(self,
dlb_metadata: Dict,
exception_type: Type[FedbiomedError],
only_required: bool = True) -> None:
"""Validate a dict of dlb_metadata according to the _validation_scheme.
Args:
dlb_metadata (dict) : the [DataLoadingBlock] metadata, as returned by serialize or as loaded from the
node database.
exception_type (Type[FedbiomedError]): the type of the exception to be raised when validation fails.
only_required (bool) : see SchemeValidator.populate_with_defaults
Raises:
exception_type: if the validation fails.
"""
try:
sc = SchemeValidator(self._validation_scheme)
except RuleError as e:
msg = ErrorNumbers.FB614.value + f": {e}"
logger.critical(msg)
raise exception_type(msg)
try:
dlb_metadata = sc.populate_with_defaults(dlb_metadata,
only_required=only_required)
except ValidatorError as e:
msg = ErrorNumbers.FB614.value + f": {e}"
logger.critical(msg)
raise exception_type(msg)
try:
sc.validate(dlb_metadata)
except ValidateError as e:
msg = ErrorNumbers.FB614.value + f": {e}"
logger.critical(msg)
raise exception_type(msg)
SkLearnDataManager
SkLearnDataManager(inputs, target, **kwargs)
Bases: object
Wrapper for pd.DataFrame
, pd.Series
and np.ndarray
datasets.
Manages datasets for scikit-learn based model training. Responsible for managing inputs, and target variables that have been provided in training_data
of scikit-learn based training plans.
The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the [fedbiomed.common.data.NPDataLoader] class for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs | Union[ndarray, DataFrame, Series] | Independent variables (inputs, features) for model training | required |
target | Union[ndarray, DataFrame, Series] | Dependent variable/s (target) for model training and validation | required |
**kwargs | dict | Loader arguments | {} |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def __init__(self,
inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
target: Union[np.ndarray, pd.DataFrame, pd.Series],
**kwargs: dict):
""" Construct a SkLearnDataManager from an array of inputs and an array of targets.
The loader arguments will be passed to the [fedbiomed.common.data.NPDataLoader] classes instantiated
when split is called. They may include batch_size, shuffle, drop_last, and others. Please see the
[fedbiomed.common.data.NPDataLoader] class for more details.
Args:
inputs: Independent variables (inputs, features) for model training
target: Dependent variable/s (target) for model training and validation
**kwargs: Loader arguments
"""
if not isinstance(inputs, (np.ndarray, pd.DataFrame, pd.Series)) or \
not isinstance(target, (np.ndarray, pd.DataFrame, pd.Series)):
msg = f"{ErrorNumbers.FB609.value}. Parameters `inputs` and `target` for " \
f"initialization of {self.__class__.__name__} should be one of np.ndarray, pd.DataFrame, pd.Series"
logger.error(msg)
raise FedbiomedTypeError(msg)
# Convert pd.DataFrame or pd.Series to np.ndarray for `inputs`
if isinstance(inputs, (pd.DataFrame, pd.Series)):
self._inputs = inputs.to_numpy()
else:
self._inputs = inputs
# Convert pd.DataFrame or pd.Series to np.ndarray for `target`
if isinstance(target, (pd.DataFrame, pd.Series)):
self._target = target.to_numpy()
else:
self._target = target
# Additional loader arguments
self._loader_arguments = kwargs
# Subset None means that train/validation split has not been performed
self._subset_test: Union[Tuple[np.ndarray, np.ndarray], None] = None
self._subset_train: Union[Tuple[np.ndarray, np.ndarray], None] = None
Functions
dataset
dataset()
Gets the entire registered dataset.
This method returns whole dataset as it is without any split.
Returns:
Name | Type | Description |
---|---|---|
inputs | ndarray | Input variables for model training |
targets | ndarray | Target variable for model training |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def dataset(self) -> Tuple[np.ndarray, np.ndarray]:
"""Gets the entire registered dataset.
This method returns whole dataset as it is without any split.
Returns:
inputs: Input variables for model training
targets: Target variable for model training
"""
return self._inputs, self._target
split
split(test_ratio, test_batch_size)
Splits np.ndarray
dataset into train and validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio | float | Ratio for validation set partition. Rest of the samples will be used for training | required |
Raises:
Type | Description |
---|---|
FedbiomedSkLearnDataManagerError | If the |
Returns:
Name | Type | Description |
---|---|---|
train_loader | NPDataLoader | NPDataLoader of input variables for model training |
test_loader | NPDataLoader | NPDataLoader of target variable for model training |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def split(self, test_ratio: float, test_batch_size: int) -> Tuple[NPDataLoader, NPDataLoader]:
"""Splits `np.ndarray` dataset into train and validation.
Args:
test_ratio: Ratio for validation set partition. Rest of the samples will be used for training
Raises:
FedbiomedSkLearnDataManagerError: If the `test_ratio` is not between 0 and 1
Returns:
train_loader: NPDataLoader of input variables for model training
test_loader: NPDataLoader of target variable for model training
"""
if not isinstance(test_ratio, float):
msg = f'{ErrorNumbers.FB609.value}: The argument `ratio` should be type `float` not {type(test_ratio)}'
logger.error(msg)
raise FedbiomedTypeError(msg)
if test_ratio < 0. or test_ratio > 1.:
msg = f'{ErrorNumbers.FB609.value}: The argument `ratio` should be equal or between 0 and 1, ' \
f'not {test_ratio}'
logger.error(msg)
raise FedbiomedTypeError(msg)
empty_subset = (np.array([]), np.array([]))
if test_ratio <= 0.:
self._subset_train = (self._inputs, self._target)
self._subset_test = empty_subset
elif test_ratio >= 1.:
self._subset_train = empty_subset
self._subset_test = (self._inputs, self._target)
else:
x_train, x_test, y_train, y_test = train_test_split(self._inputs, self._target, test_size=test_ratio)
self._subset_test = (x_test, y_test)
self._subset_train = (x_train, y_train)
if not test_batch_size:
test_batch_size = len(self._subset_test)
return self._subset_loader(self._subset_train, **self._loader_arguments), \
self._subset_loader(self._subset_test, batch_size=test_batch_size)
subset_test
subset_test()
Gets Subset of dataset for validation partition.
Returns:
Name | Type | Description |
---|---|---|
test_inputs | ndarray | Input variables of validation subset for model validation |
test_target | ndarray | Target variable of validation subset for model validation |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def subset_test(self) -> Tuple[np.ndarray, np.ndarray]:
"""Gets Subset of dataset for validation partition.
Returns:
test_inputs: Input variables of validation subset for model validation
test_target: Target variable of validation subset for model validation
"""
return self._subset_test
subset_train
subset_train()
Gets Subset for train partition.
Returns:
Name | Type | Description |
---|---|---|
test_inputs | ndarray | Input variables of training subset for model training |
test_target | ndarray | Target variable of training subset for model training |
Source code in fedbiomed/common/data/_sklearn_data_manager.py
def subset_train(self) -> Tuple[np.ndarray, np.ndarray]:
"""Gets Subset for train partition.
Returns:
test_inputs: Input variables of training subset for model training
test_target: Target variable of training subset for model training
"""
return self._subset_train
TabularDataset
TabularDataset(inputs, target)
Bases: Dataset
Torch based Dataset object to create torch Dataset from given numpy or dataframe type of input and target variables
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs | Union[ndarray, DataFrame, Series] | Input variables that will be passed to network | required |
target | Union[ndarray, DataFrame, Series] | Target variable for output layer | required |
Raises:
Type | Description |
---|---|
FedbiomedTorchDatasetError | If input variables and target variable does not have equal length/size |
Source code in fedbiomed/common/data/_tabular_dataset.py
def __init__(self,
inputs: Union[np.ndarray, pd.DataFrame, pd.Series],
target: Union[np.ndarray, pd.DataFrame, pd.Series]):
"""Constructs PyTorch dataset object
Args:
inputs: Input variables that will be passed to network
target: Target variable for output layer
Raises:
FedbiomedTorchDatasetError: If input variables and target variable does not have
equal length/size
"""
# Inputs and target variable should be converted to the torch tensors
# PyTorch provides `from_numpy` function to convert numpy arrays to
# torch tensor. Therefore, if the arguments `inputs` and `target` are
# instance one of `pd.DataFrame` or `pd.Series`, they should be converted to
# numpy arrays
if isinstance(inputs, (pd.DataFrame, pd.Series)):
self.inputs = inputs.to_numpy()
elif isinstance(inputs, np.ndarray):
self.inputs = inputs
else:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `inputs` should be "
f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
# Configuring self.target attribute
if isinstance(target, (pd.DataFrame, pd.Series)):
self.target = target.to_numpy()
elif isinstance(inputs, np.ndarray):
self.target = target
else:
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: The argument `target` should be "
f"an instance one of np.ndarray, pd.DataFrame or pd.Series")
# The lengths should be equal
if len(self.inputs) != len(self.target):
raise FedbiomedDatasetError(f"{ErrorNumbers.FB610.value}: Length of input variables and target "
f"variable does not match. Please make sure that they have "
f"equal size while creating the method `training_data` of "
f"TrainingPlan")
# Convert `inputs` adn `target` to Torch floats
self.inputs = from_numpy(self.inputs).float()
self.target = from_numpy(self.target).float()
Attributes
inputs instance-attribute
inputs = float()
target instance-attribute
target = float()
Functions
get_dataset_type staticmethod
get_dataset_type()
Source code in fedbiomed/common/data/_tabular_dataset.py
@staticmethod
def get_dataset_type() -> DatasetTypes:
return DatasetTypes.TABULAR
TorchDataManager
TorchDataManager(dataset, **kwargs)
Bases: object
Wrapper for PyTorch Dataset to manage loading operations for validation and train.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset | Dataset | Dataset object for torch.utils.data.DataLoader | required |
**kwargs | dict | Arguments for PyTorch | {} |
Raises:
Type | Description |
---|---|
FedbiomedTorchDataManagerError | If the argument |
Source code in fedbiomed/common/data/_torch_data_manager.py
def __init__(self, dataset: Dataset, **kwargs: dict):
"""Construct of class
Args:
dataset: Dataset object for torch.utils.data.DataLoader
**kwargs: Arguments for PyTorch `DataLoader`
Raises:
FedbiomedTorchDataManagerError: If the argument `dataset` is not an instance of `torch.utils.data.Dataset`
"""
# TorchDataManager should get `dataset` argument as an instance of torch.utils.data.Dataset
if not isinstance(dataset, Dataset):
raise FedbiomedTorchDataManagerError(
f"{ErrorNumbers.FB608.value}: The attribute `dataset` should an instance "
f"of `torch.utils.data.Dataset`, please use `Dataset` as parent class for"
f"your custom torch dataset object")
self._dataset = dataset
self._loader_arguments = kwargs
self._subset_test: Union[Subset, None] = None
self._subset_train: Union[Subset, None] = None
Attributes
Functions
load_all_samples
load_all_samples()
Loading all samples as PyTorch DataLoader without splitting.
Returns:
Type | Description |
---|---|
DataLoader | Dataloader for entire datasets. |
Source code in fedbiomed/common/data/_torch_data_manager.py
def load_all_samples(self) -> DataLoader:
"""Loading all samples as PyTorch DataLoader without splitting.
Returns:
Dataloader for entire datasets. `DataLoader` arguments will be retrieved from the `**kwargs` which
is defined while initializing the class
"""
return self._create_torch_data_loader(self._dataset, **self._loader_arguments)
split
split(test_ratio, test_batch_size)
Splitting PyTorch Dataset into train and validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_ratio | float | Split ratio for validation set ratio. Rest of the samples will be used for training | required |
Raises: FedbiomedTorchDataManagerError: If the ratio is not in good format
Returns:
Name | Type | Description |
---|---|---|
train_loader | Union[DataLoader, None] | DataLoader for training subset. |
test_loader | Union[DataLoader, None] | DataLoader for validation subset. |
Source code in fedbiomed/common/data/_torch_data_manager.py
def split(self, test_ratio: float, test_batch_size: Union[int, None]) -> Tuple[Union[DataLoader, None], Union[DataLoader, None]]:
""" Splitting PyTorch Dataset into train and validation.
Args:
test_ratio: Split ratio for validation set ratio. Rest of the samples will be used for training
Raises:
FedbiomedTorchDataManagerError: If the ratio is not in good format
Returns:
train_loader: DataLoader for training subset. `None` if the `test_ratio` is `1`
test_loader: DataLoader for validation subset. `None` if the `test_ratio` is `0`
"""
# Check the argument `ratio` is of type `float`
if not isinstance(test_ratio, (float, int)):
raise FedbiomedTorchDataManagerError(f'{ErrorNumbers.FB608.value}: The argument `ratio` should be '
f'type `float` or `int` not {type(test_ratio)}')
# Check ratio is valid for splitting
if test_ratio < 0 or test_ratio > 1:
raise FedbiomedTorchDataManagerError(f'{ErrorNumbers.FB608.value}: The argument `ratio` should be '
f'equal or between 0 and 1, not {test_ratio}')
# If `Dataset` has proper data attribute
# try to get shape from self.data
if not hasattr(self._dataset, '__len__'):
raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
f"{str(self._dataset)} without `__len__`. Please make sure "
f"that `__len__` method has been added to custom dataset. "
f"This method should return total number of samples.")
try:
samples = len(self._dataset)
except AttributeError as e:
raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
f"{str(self._dataset)} due to undefined attribute, {str(e)}")
except TypeError as e:
raise FedbiomedTorchDataManagerError(f"{ErrorNumbers.FB608.value}: Can not get number of samples from "
f"{str(self._dataset)}, {str(e)}")
# Calculate number of samples for train and validation subsets
test_samples = math.floor(samples * test_ratio)
train_samples = samples - test_samples
self._subset_train, self._subset_test = random_split(self._dataset, [train_samples, test_samples])
if not test_batch_size:
test_batch_size = len(self._subset_test)
loaders = (self._subset_loader(self._subset_train, **self._loader_arguments),
self._subset_loader(self._subset_test, batch_size = test_batch_size))
return loaders
subset_test
subset_test()
Gets validation subset of the dataset.
Returns:
Type | Description |
---|---|
Subset | Validation subset |
Source code in fedbiomed/common/data/_torch_data_manager.py
def subset_test(self) -> Subset:
"""Gets validation subset of the dataset.
Returns:
Validation subset
"""
return self._subset_test
subset_train
subset_train()
Gets train subset of the dataset.
Returns:
Type | Description |
---|---|
Subset | Train subset |
Source code in fedbiomed/common/data/_torch_data_manager.py
def subset_train(self) -> Subset:
"""Gets train subset of the dataset.
Returns:
Train subset
"""
return self._subset_train
to_sklearn
to_sklearn()
Converts PyTorch Dataset
to sklearn data manager of Fed-BioMed.
Returns:
Type | Description |
---|---|
SkLearnDataManager | Data manager to use in SkLearn base training plans |
Source code in fedbiomed/common/data/_torch_data_manager.py
def to_sklearn(self) -> SkLearnDataManager:
"""Converts PyTorch `Dataset` to sklearn data manager of Fed-BioMed.
Returns:
Data manager to use in SkLearn base training plans
"""
loader = self._create_torch_data_loader(self._dataset, batch_size=len(self._dataset))
# Iterate over samples and get input variable and target variable
inputs = next(iter(loader))[0].numpy()
target = next(iter(loader))[1].numpy()
return SkLearnDataManager(inputs=inputs, target=target, **self._loader_arguments)
Functions
discover_flamby_datasets
discover_flamby_datasets()
Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.
Returns:
Name | Type | Description |
---|---|---|
Dict[int, str] | a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module | |
Dict[int, str] | corresponding to a dataset, represented as str. To import said module one must prepend with the correct | |
path | Dict[int, str] |
|
Source code in fedbiomed/common/data/_flamby_dataset.py
def discover_flamby_datasets() -> Dict[int, str]:
"""Automatically discover the available Flamby datasets based on the contents of the flamby.datasets module.
Returns:
a dictionary {index: dataset_name} where index is an int and dataset_name is the name of a flamby module
corresponding to a dataset, represented as str. To import said module one must prepend with the correct
path: `import flamby.datasets.dataset_name`.
"""
dataset_list = [name for _, name, ispkg in pkgutil.iter_modules(flamby_datasets_module.__path__) if ispkg]
return {i: name for i, name in enumerate(dataset_list)}