Datasets

Module includes the classes that allow researcher to interact with remote datasets (federated datasets).

Attributes

Classes

FederatedDataSet

FederatedDataSet(data)

A class that allows researcher to interact with remote datasets (federated datasets).

It contains details about remote datasets, such as client ids, data size that can be useful for aggregating or sampling strategies on researcher's side

Parameters:

Name Type Description Default
data Dict

Dictionary of datasets. Each key is a str representing a node's ID. Each value is a dict (or a list containing exactly one dict). Each dict contains the description of the dataset associated to this node in the federated dataset.

required

Raises:

Type Description
FedbiomedFederatedDataSetError

bad data format

Source code in fedbiomed/researcher/datasets.py
def __init__(self, data: Dict):
    """Construct FederatedDataSet object.

    Args:
        data: Dictionary of datasets. Each key is a `str` representing a node's ID. Each value is
            a `dict` (or a `list` containing exactly one `dict`). Each `dict` contains the description
            of the dataset associated to this node in the federated dataset. 

    Raises:
        FedbiomedFederatedDataSetError: bad `data` format
    """
    # check structure of data
    self._v = Validator()
    self._v.register("list_or_dict", self._dataset_type, override=True)
    try:
        self._v.validate(data, dict)
        for node, ds in data.items():
            self._v.validate(node, str)
            self._v.validate(ds, "list_or_dict")
            if isinstance(ds, list):
                if len(ds) == 1:
                    self._v.validate(ds[0], dict)
                    # convert list of one dict to dict
                    data[node] = ds[0]
                else:
                    errmess = f'{ErrorNumbers.FB416.value}: {node} must have one unique dataset ' \
                        f'but has {len(ds)} datasets.'
                    logger.error(errmess)
                    raise FedbiomedFederatedDataSetError(errmess)
    except ValidatorError as e:
        errmess = f'{ErrorNumbers.FB416.value}: bad parameter `data` must be a `dict` of ' \
            f'(`list` of one) `dict`: {e}'
        logger.error(errmess)
        raise FedbiomedFederatedDataSetError(errmess)

    self._data = data

Functions

data
data()

Retrieve FederatedDataset as dict.

Returns:

Type Description
Dict

Federated datasets, keys as node ids

Source code in fedbiomed/researcher/datasets.py
def data(self) -> Dict:
    """Retrieve FederatedDataset as [`dict`][dict].

    Returns:
       Federated datasets, keys as node ids
    """
    return self._data
node_ids
node_ids()

Retrieve Node ids from FederatedDataSet.

Returns:

Type Description
List[str]

List of node ids

Source code in fedbiomed/researcher/datasets.py
def node_ids(self) -> List[str]:
    """Retrieve Node ids from `FederatedDataSet`.

    Returns:
        List of node ids
    """
    return list(self._data.keys())
sample_sizes
sample_sizes()

Retrieve list of sample sizes of node's dataset.

Returns:

Type Description
List[int]

List of sample sizes in federated datasets in the same order with node_ids

Source code in fedbiomed/researcher/datasets.py
def sample_sizes(self) -> List[int]:
    """Retrieve list of sample sizes of node's dataset.

    Returns:
        List of sample sizes in federated datasets in the same order with
            [node_ids][fedbiomed.researcher.datasets.FederatedDataSet.node_ids]
    """
    sample_sizes = []
    for (key, val) in self._data.items():
        sample_sizes.append(val["shape"][0])

    return sample_sizes
shapes
shapes()

Get shape of FederatedDatasets by node ids.

Returns:

Type Description
Dict[str, int]

Includes sample_sizes by node_ids.

Source code in fedbiomed/researcher/datasets.py
def shapes(self) -> Dict[str, int]:
    """Get shape of FederatedDatasets by node ids.

    Returns:
        Includes [`sample_sizes`][fedbiomed.researcher.datasets.FederatedDataSet.sample_sizes] by node_ids.
    """
    shapes_dict = {}
    for node_id, node_data_size in zip(self.node_ids(),
                                       self.sample_sizes()):
        shapes_dict[node_id] = node_data_size

    return shapes_dict