Federated Analytics — Researcher
Overview
Federated Analytics (FA) lets you compute statistics — such as means and variances — across datasets held by multiple remote nodes, without the raw data ever leaving those nodes.
This page covers the researcher side: how to issue analytics requests and interpret the results.
- For which datasets support FA and how to make a custom dataset FA-compatible, see Federated Analytics — Datasets.
- For enabling FA on a node, see Federated Analytics — Nodes.
How it works
Request / compute / aggregate cycle
Every analytics call follows the same three-phase cycle:
- Request — you send a statistics request to all participating nodes.
- Local compute — each node runs the computation on its own data and sends back only the summary (not the raw data).
- Aggregate — FedBioMed combines the per-node summaries into a single global result on the researcher side.
Because each node only ever returns aggregated summaries, the raw data never leaves the node.
The dataset schema
Every FA-compatible dataset declares a schema — a description of what its samples look like. The schema is what the dataset's schema returns: column names for tabular data, or nested structures for multi-modal datasets (see Federated Analytics — Datasets for details on how datasets declare their schema).
When you call fetch_stats(), the analytics engine reads this schema from each node's dataset to know how to interpret each sample. You don't normally need to read or manipulate it directly — dataset_schema lets you select a subset of it (see below).
The Experiment object and analytics
Experiment is the single entry point for both federated training and federated analytics (see Experiment for more). When you instantiate it with a set of tags, it contacts all connected nodes, discovers which ones hold a matching dataset, and collects the responses. As part of that setup it also creates a FederatedAnalytics object and exposes it as exp.analytics.
FederatedAnalytics access
You never construct FederatedAnalytics directly; it is always accessed through the analytics attribute of an experiment. All network communication, result caching, and aggregation are handled internally — from your perspective, you call a method and get back a result.
Caching
To avoid redundant network requests, results are cached per (node_ids, dataset_schema, stats_args) combination. When you repeat a call with the same parameters, the cached result is returned immediately — no messages are sent to the nodes.
Hands on
Convenience Methods
A convenience method is a thin wrapper around fetch_stats that covers the most common use-cases without requiring you to name the statistic explicitly. Each one sends the request, waits for all nodes to respond, aggregates the partial results, and returns the global value in a single call:
count = exp.analytics.count()
mean = exp.analytics.mean()
variance = exp.analytics.variance()
The return value is a dict keyed by column name (for tabular data) or by modality name (for multi-modal data). For example, on a two-node tabular dataset with columns year, price, mileage, mpg, …:
>>> exp.analytics.mean()
{'year': 2016.96, 'price': 16235.20, 'transmission': 1.02,
'mileage': 23908.94, 'tax': 118.06, 'mpg': 55.25, 'engineSize': 1.57}
>>> exp.analytics.variance()
{'year': 4.40, 'price': 91583707.11, 'transmission': 0.31,
'mileage': 444227213.03, 'tax': 4131.06, 'mpg': 138.72, 'engineSize': 0.33}
>>> exp.analytics.count()
{'year': 28632, 'price': 28633, 'transmission': 28633,
'mileage': 28633, 'tax': 28633, 'mpg': 28633, 'engineSize': 28633}
Convenience method limitations
countcan differ per column when a column has missing values.histogramis not yet available — it is under validation and its implementation is not yet complete.
Filtering with dataset_schema
By default, statistics are computed over everything the schema describes. Pass dataset_schema to restrict computation to a subset of columns or modalities.
Tabular dataset — select columns by name:
# Only compute the mean over 'age' and 'bmi' columns
mean = exp.analytics.mean(dataset_schema=["age", "bmi"])
Column names
The list must contain column names that exist in the dataset's schema.
Multi-modal dataset — select modalities, and optionally columns within them:
For datasets that expose multiple modalities, dataset_schema is either a list of modality names to include in full, or a dict mapping each modality name to a column selector for that modality:
# Include 'age'/'bmi' from 'demographics'
mean = exp.analytics.mean(
dataset_schema={"demographics": ["age", "bmi"]}
)
Modality filtering
Any modality not mentioned in dataset_schema is excluded from the computation.
Using fetch_stats and fetch_stats_with_args
The convenience methods internally call fetch_stats. You can call it directly when you need full control over what is requested or want to inspect per-node results. It returns an FAResult object:
result = exp.analytics.fetch_stats(
stats="variance", # statistic to compute — 'mean', 'variance', 'count'
dataset_schema=["age", "bmi"], # optional column filter
)
# Globally aggregated value for a single named statistic
variance_result = result.global_stats("variance")
# All available stats merged into one nested dict: {column: {stat_name: value}}
all_stats = result.global_stats()
# Raw output from a single node — useful for debugging or per-site analysis
node_output = result.node_stats("node-id-1")
For statistics that require computation parameters (e.g. histogram), fetch_stats_with_args will be the entry point once those statistics are available. Schema selection and parameters are both encoded inside stats_args — there is no separate dataset_schema.
Statistics Requiring Arguments
histogram is not yet available
histogram is partially implemented but is currently under validation. The API and example below describe the intended interface for when it becomes available — do not rely on it in production yet.
histogram needs to know the bin edges for each column upfront. The bins must be identical across all nodes so that the per-node counts can be summed into a global histogram. Use fetch_stats_with_args and supply both schema selection and parameters inside stats_args:
result = exp.analytics.fetch_stats_with_args(
stats_args={
"age": {"histogram": {"bin_edges": [0, 20, 40, 60, 80, 100]}},
"income": {"histogram": {"bin_edges": [0, 25000, 50000, 75000, 100000]}},
}
)
fetch_stats_with_args vs fetch_stats
fetch_stats_with_args does not accept a dataset_schema argument — schema selection is embedded in stats_args. Use fetch_stats for all statistics that do not require extra parameters.
FAResult Reference
fetch_stats returns an FAResult object. Rather than returning a plain dict immediately, FedBioMed gives you this object for two reasons:
- Lazy aggregation. Per-node outputs are stored as-is. Global aggregation only happens when you call a method like
global_stats(), so you pay the cost only when you need the result. - Incremental merging. If you make a second
fetch_statscall for a different statistic (but the same nodes and schema), the new results are merged into the sameFAResult— no data is lost and no nodes are re-contacted for stats already in the cache. - Per-node access. You can inspect what each individual node returned before aggregation, which is useful for detecting data quality issues or comparing site-specific distributions.
What it stores
Internally, FAResult holds a dict mapping each node ID to that node's raw output tree. For a tabular dataset the output tree looks like:
{
"NODE_abc": {
"age": {"mean": 47.2, "count": 100, "variance": 120.5},
"income": {"mean": 55000, "count": 100, "variance": 8.2e8},
},
"NODE_xyz": {
"age": {"mean": 44.8, "count": 98, "variance": 110.3},
"income": {"mean": 52000, "count": 98, "variance": 7.9e8},
},
}
FAResult leaf structure
Each leaf is a dict of raw stat primitives. Aggregating the mean, for example, requires both mean and count (to compute a weighted average), which is why each leaf stores multiple keys even when you only asked for one statistic.
Computable vs available stats
available_stats()— the raw stat keys present at the leaves (e.g.["count", "mean", "variance"]). These are what was actually computed on the nodes.computable_stats()— the higher-level statistics that can be derived from those raw keys (e.g.["count", "mean", "std", "sum", "variance"]).stdis computable even if it was never sent by the nodes, because it can be derived frommean,variance, andcount.
print(result.available_stats()) # ['count', 'mean', 'variance']
print(result.computable_stats()) # ['count', 'mean', 'std', 'sum', 'variance']
Accessing results
# Globally aggregated value for one statistic — same structure as per-node output
result.global_stats("mean")
# → {'age': 45.8, 'income': 53200, ...}
# All computable stats merged into {column: {stat_name: value}}
result.global_stats()
# → {'age': {'mean': 45.8, 'std': 10.9, 'variance': 118.7, ...}, ...}
# Raw output from one node — useful for site-level analysis or debugging
result.node_stats("NODE_abc")
# Outputs from all nodes
result.node_stats()
# Inspect the output tree structure without raw values
result.schema
# → {'age': {}, 'income': {}, ...}
# Which nodes replied?
result.node_ids
Methods summary
| Method / property | Description |
|---|---|
result.global_stats(stat_name) | The globally aggregated value for one statistic (e.g. "mean") |
result.global_stats() | All computable stats aggregated and merged into {column: {stat_name: value}} |
result.node_stats(node_id) | The raw statistics returned by a single node, before aggregation |
result.node_stats() | Dict of all per-node raw outputs keyed by node ID |
result.computable_stats() | List of statistic names that can be derived from the stored data |
result.available_stats() | List of raw stat keys present in the stored output |
result.node_ids | List of node IDs whose results are stored |
result.schema | The output tree structure with leaf values replaced by {} — useful for programmatic inspection |
Common Errors & Troubleshooting
| Error message | Cause | Fix |
|---|---|---|
'stats' must be a non-empty string or list | fetch_stats([]) was called with an explicit empty list | Pass at least one statistic name, e.g. fetch_stats("mean"), or omit stats to use the default ["count", "mean", "variance"] |
Federated Analytics are not allowed on this node | A node has allow_federated_analytics = False in its config | Ask the node operator to enable the flag — see Federated Analytics — Nodes |