Datasets

Module includes the classes that allow researcher to interact with remote datasets (federated datasets).

Attributes

FederatedDataSet module-attribute

FederatedDataSet = FederatedDataset

Classes

FederatedDataset

FederatedDataset(data=None)

A class that allows researcher to interact with remote datasets (federated datasets).

It contains details about remote datasets, such as client ids, data size that can be useful for aggregating or sampling strategies on researcher's side

Parameters:

Name Type Description Default
data Optional[Dict]

Dictionary of datasets. Each key is a str representing a node's ID. Each value is a dict (or a list containing exactly one dict). Each dict contains the description of the dataset associated to this node in the federated dataset.

None
Source code in fedbiomed/researcher/datasets.py
def __init__(self, data: Optional[Dict] = None):
    """Construct FederatedDataset object.

    Args:
        data:  Dictionary of datasets. Each key is a `str` representing a node's ID. Each value is
            a `dict` (or a `list` containing exactly one `dict`). Each `dict` contains the description
            of the dataset associated to this node in the federated dataset.
    """
    # check structure of data

    if data is not None:
        self.set_federated_dataset(data)
    else:
        self._data = {}

Functions

data
data()

Retrieve FederatedDataset as dict.

Returns:

Type Description
Dict

Federated datasets, keys as node ids

Source code in fedbiomed/researcher/datasets.py
def data(self) -> Dict:
    """Retrieve FederatedDataset as [`dict`][dict].

    Returns:
       Federated datasets, keys as node ids
    """
    return self._data
node_ids
node_ids()

Retrieve Node ids from FederatedDataset.

Returns:

Type Description
List[str]

List of node ids

Source code in fedbiomed/researcher/datasets.py
def node_ids(self) -> List[str]:
    """Retrieve Node ids from `FederatedDataset`.

    Returns:
        List of node ids
    """
    return list(self._data.keys())
sample_sizes
sample_sizes()

Retrieve list of sample sizes of node's dataset.

Returns:

Type Description
List[int]

List of sample sizes in federated datasets in the same order with node_ids

Source code in fedbiomed/researcher/datasets.py
def sample_sizes(self) -> List[int]:
    """Retrieve list of sample sizes of node's dataset.

    Returns:
        List of sample sizes in federated datasets in the same order with
            [node_ids][fedbiomed.researcher.datasets.FederatedDataset.node_ids]
    """
    sample_sizes = []
    for _, val in self._data.items():
        sample_sizes.append(val["shape"][0])

    return sample_sizes
set_federated_dataset
set_federated_dataset(datasets)

Set federated dataset.

Parameters:

Name Type Description Default
datasets Dict

Dictionary of datasets. Each key is a str representing a node's ID. Each value is a dict (or a list containing exactly one dict). Each dict contains the description of the dataset associated to this node in the federated dataset.

required

Raises:

Type Description
FedbiomedError

bad data format

Source code in fedbiomed/researcher/datasets.py
def set_federated_dataset(self, datasets: Dict) -> None:
    """Set federated dataset.

    Args:
        datasets:  Dictionary of datasets. Each key is a `str` representing a node's ID. Each value is
            a `dict` (or a `list` containing exactly one `dict`). Each `dict` contains the description
            of the dataset associated to this node in the federated dataset.

    Raises:
        FedbiomedError: bad `data` format
    """
    # check structure of data
    # DEPRECATED: to be removed in future versions
    if isinstance(datasets, FederatedDataset):
        logger.warning(
            "DEPRECATED: Passing a `FederatedDataset` instance"
            " to the `data` parameter of `FederatedDataset` is deprecated and "
            "will not be supported in future versions. Please pass a `dict` "
            "representing the federated dataset instead."
        )
        datasets = copy.deepcopy(datasets.data())

    if isinstance(datasets, dict) is False:
        raise FedbiomedError(
            f"{ErrorNumbers.FB416.value}: bad parameter `data` must be a `dict` of "
            f"(`list` of one) `dict`."
        )

    for node_id, node_data in datasets.items():
        if not (isinstance(node_data, dict) or isinstance(node_data, list)):
            raise FedbiomedError(
                f"{ErrorNumbers.FB416.value}: bad parameter `data` for node {node_id}. "
                f"Must be a `dict` or a `list` containing exactly one `dict`."
            )
        if isinstance(node_data, list):
            if len(node_data) != 1 or not isinstance(node_data[0], dict):
                raise FedbiomedError(
                    f"{ErrorNumbers.FB416.value}: bad parameter `data` for node {node_id}. "
                    f"Must be a `dict` or a `list` containing exactly one `dict`."
                )
            else:
                datasets[node_id] = node_data[0]

    self._data = datasets
shapes
shapes()

Get shape of FederatedDatasets by node ids.

Returns:

Type Description
Dict[str, int]

Includes sample_sizes by node_ids.

Source code in fedbiomed/researcher/datasets.py
def shapes(self) -> Dict[str, int]:
    """Get shape of FederatedDatasets by node ids.

    Returns:
        Includes [`sample_sizes`][fedbiomed.researcher.datasets.FederatedDataset.sample_sizes] by node_ids.
    """
    shapes_dict = {}
    for node_id, node_data_size in zip(
        self.node_ids(), self.sample_sizes(), strict=False
    ):
        shapes_dict[node_id] = node_data_size

    return shapes_dict