Federated Analytics — Datasets
Overview
Federated Analytics (FA) lets researchers compute statistics — such as means and variances — across datasets that live on multiple remote nodes, all without the raw data ever leaving those nodes.
In FedBioMed, a node is a machine controlled by a data owner (e.g. a hospital) that holds a local dataset. Instead of centralising data, each node computes statistics locally and sends only the aggregated summaries back to the researcher.
This page covers the dataset side of Federated Analytics: which datasets support it and how to make a custom dataset FA-compatible.
- For how to run analytics as a researcher, see Federated Analytics — Researcher.
- For how to enable FA on a node, see Federated Analytics — Nodes.
What FA can compute — and on which datasets
Dataset element types
FA treats every dataset as a collection of samples, and each sample as one or more elements. Each element has a type that determines which statistics can be computed on it:
| Element type | What it represents |
|---|---|
| ROW | A single row of named columns (tabular data) |
| IMAGE | An N-dimensional array without named columns |
Dataset element types
- Multi-modal datasets can contain both types simultaneously — FA handles them independently.
- Built-in dataset classes (
TabularDataset,MedicalFolderDataset) declare their element type automatically and are FA-compatible out of the box. If you are writing a custom dataset you must declare it yourself — see below.
Available statistics
Current implementation status
Only tabular (ROW) data is currently supported. Image datasets are not yet covered by FA. The enabled statistics are count, mean, and variance. histogram is partially implemented but is under validation and not yet available for use.
ROW elements (tabular data)
| Statistic | What it computes | Extra arguments required | Status |
|---|---|---|---|
count | Number of non-missing values per column | — | Enabled |
mean | Weighted mean per column across all nodes | — | Enabled |
variance | Population variance per column across all nodes | — | Enabled |
histogram | Per-column frequency counts in fixed bins | bin_edges per column (must be identical on all nodes) | Under validation |
IMAGE elements (not yet supported)
Image datasets are not covered by the current FA implementation. The ImageSpec and _image.py accumulator exist in the codebase but are not active. Support will be added in a future release.
Making a Custom Dataset FA-Compatible
If your dataset inherits from fedbiomed.common.dataset.Dataset, one addition is needed.
Data return format
The analytics engine requires your dataset to return data as NumPy arrays (DataReturnFormat.SKLEARN). Make sure this is set during your dataset's initialisation before using FA.
Implement analytics_schema()
analytics_schema() returns a description of what __getitem__ produces, so the analytics engine knows how to interpret each sample.
__getitem__ returns a (data, target) tuple. analytics_schema() mirrors that structure: it returns a (data_spec, target_spec) tuple where each spec is either a RowSpec, an ImageSpec, or None (meaning "skip this part").
- Use
RowSpec(columns=[...])when the element is a 2-D NumPy array whose columns have names (tabular data). The column list must match the column order that__getitem__actually returns in that position. - Use
ImageSpec()when the element is an N-D NumPy array. - Use
Nonefor parts that analytics should ignore (typically the target).
from fedbiomed.common.dataset_types import RowSpec, ImageSpec
class MyDataset(Dataset):
def __getitem__(self, idx):
# returns (array with columns ["age", "weight"], label)
...
def analytics_schema(self):
# Mirrors __getitem__: describe `data` with RowSpec, skip `target` with None
return RowSpec(columns=["age", "weight"]), None
For a multi-modal dataset whose __getitem__ returns a dict as its first element, the schema's first element must be a matching dict — same keys, each mapped to the appropriate spec:
def __getitem__(self, idx):
# data is a dict; keys must match the schema below
data = {
"demographics": array_of_shape(n, 2), # tabular
"T1": array_of_shape(h, w, d), # 3-D image
}
return data, None
def analytics_schema(self):
return {
"demographics": RowSpec(columns=["age", "weight"]),
"T1": ImageSpec(),
}, None
Image support not yet available
ImageSpec is defined in the codebase but image datasets are not yet supported by FA. Including an ImageSpec modality will have no effect until image support is complete.
Common Errors & Troubleshooting
| Error message | Cause | Fix |
|---|---|---|
Dataset does not implement 'analytics_schema' | Custom dataset is missing the analytics_schema() method | Add analytics_schema() (see above) |
Dataset format … is not supported for analytics | The dataset is not configured to return NumPy arrays | Ensure self.to_format = DataReturnFormat.SKLEARN is set during dataset initialisation |
Dataset does not support analytics method 'compute_stats' | The dataset class does not inherit from fedbiomed.common.dataset.Dataset | Make sure your class inherits from fedbiomed.common.dataset.Dataset |