FA Tutorial 1 — Tabular Dataset¶
Federated Analytics (FA) lets researchers compute statistics across distributed datasets without any raw data leaving the nodes. Each node computes partial statistics locally; the researcher aggregates them globally.
This notebook covers FA on tabular (CSV) datasets. Companion notebooks cover image and medical folder datasets.
What you will learn¶
- How to run one-liner FA convenience methods (
mean,variance, etc.) - How to request multiple statistics in a single round-trip
- How to filter columns with
dataset_schema - How to inspect and aggregate results using the
FAResultAPI - How caching avoids redundant network round-trips
Before You Start¶
Configure nodes¶
# Node 1
fedbiomed node -p my-node-1 dataset add # add a CSV dataset with tag 'tabular'
fedbiomed node -p my-node-1 start
# Node 2 (optional)
fedbiomed node -p my-node-2 dataset add
fedbiomed node -p my-node-2 start
Wait until you see Starting task manager in each node terminal.
Start the researcher¶
fedbiomed researcher start
Note on FA permissions: FA is enabled by default. A node administrator can disable it by setting
allow_federated_analytics = Falsein the node config under[security].
Create an Experiment¶
Experiment discovers all nodes that have a dataset with the given tag and sets up the analytics property automatically.
from fedbiomed.researcher.federated_workflows import Experiment
tags = ['tabular']
exp = Experiment(tags=tags)
2026-03-11 11:37:16,262 fedbiomed DEBUG - adding handler for: SECURITY_FILE
2026-03-11 11:37:16,349 fedbiomed INFO - Starting researcher service...
2026-03-11 11:37:16,350 fedbiomed INFO - Waiting 3s for nodes to connect...
2026-03-11 11:37:19,354 fedbiomed INFO - Updating training data. This action will update FederatedDataset, and the nodes that will participate to the experiment.
2026-03-11 11:37:19,369 fedbiomed INFO - Node selected for training -> Default Node Alias Node ID is -> NODE_e1d11980-12dd-4fde-a356-6554d68c593d
2026-03-11 11:37:19,369 fedbiomed INFO - Node selected for training -> Default Node Alias Node ID is -> NODE_d0e82145-2311-46cb-93f8-922f36f4b71d
Inspect the datasets discovered on nodes. Each entry shows the dataset metadata (columns, types, tags) registered by the node administrator.
exp.training_data().data()
{'NODE_e1d11980-12dd-4fde-a356-6554d68c593d': {'name': 'csv',
'data_type': 'csv',
'tags': ['tabular'],
'description': 'tabular dataset',
'shape': {'csv': [10668, 7]},
'dtypes': {'year': 'Int64',
'price': 'Int64',
'transmission': 'Int64',
'mileage': 'Int64',
'tax': 'Int64',
'mpg': 'Float64',
'engineSize': 'Float64'},
'dataset_id': 'dataset_0ca1dd3b-6bb5-41a4-86ef-2f6b3a905ff3',
'dataset_parameters': {}},
'NODE_d0e82145-2311-46cb-93f8-922f36f4b71d': {'name': 'cars',
'data_type': 'csv',
'tags': ['tabular'],
'description': 'toy-dataaset',
'shape': {'csv': [17965, 7]},
'dtypes': {'year': 'Int64',
'price': 'Int64',
'transmission': 'Int64',
'mileage': 'Int64',
'tax': 'Int64',
'mpg': 'Float64',
'engineSize': 'Float64'},
'dataset_id': 'dataset_15f67e62-e534-49e0-a080-d42849878e3d',
'dataset_parameters': {}}} Convenience Methods¶
The analytics property exposes one-liner methods for the five most common statistics. Each method:
- Sends the request to all nodes (only on the first call; subsequent calls use the cache).
- Receives per-node partial results.
- Returns the globally aggregated value directly.
# Compute the global mean across all columns and all nodes
exp.analytics.mean()
{'year': 2016.9620209933992,
'price': 16235.200740670456,
'transmission': 1.0212691821941424,
'mileage': 23908.94412840134,
'tax': 118.05766210348528,
'mpg': 55.24739984969232,
'engineSize': 1.5668631812010942} # Other convenience methods
print('Variance:', exp.analytics.variance())
# results can come from cache (no new network round-trip)
print('Count: ', exp.analytics.count())
Variance: {'year': 4.400819618015132, 'price': 91583707.107012, 'transmission': 0.305044955392466, 'mileage': 444227213.02946347, 'tax': 4131.056130965234, 'mpg': 138.7155247721354, 'engineSize': 0.33134486956551856}
Count: {'year': 28632, 'price': 28633, 'transmission': 28633, 'mileage': 28633, 'tax': 28633, 'mpg': 28633, 'engineSize': 28633}
Exploring the FAResult Object¶
fetch_stats gives full control and returns a raw FAResult — useful when you want to inspect per-node data, check what statistics are available, or retrieve multiple stats together.
result = exp.analytics.fetch_stats('mean')
# Which nodes replied?
print('Node IDs:', result.node_ids)
Node IDs: ['NODE_e1d11980-12dd-4fde-a356-6554d68c593d', 'NODE_d0e82145-2311-46cb-93f8-922f36f4b71d']
# Schema mirrors the structure of node outputs, with stat-leaf positions shown as {}
result.schema
{'year': {},
'price': {},
'transmission': {},
'mileage': {},
'tax': {},
'mpg': {},
'engineSize': {}} # Which stat keys are stored at leaves (across all nodes)?
print('Available stats:', result.available_stats())
# Which can be aggregated globally?
print('Computable stats:', result.computable_stats())
Available stats: ['count', 'mean', 'variance'] Computable stats: ['count', 'mean', 'std', 'sum', 'variance']
# Raw per-node outputs — useful for debugging or site-level comparisons
result.node_stats()
{'NODE_e1d11980-12dd-4fde-a356-6554d68c593d': {'year': {'mean': 2017.100684353619,
'count': 10667,
'variance': 4.6984687},
'price': {'mean': 22896.685039370048,
'count': 10668,
'variance': 137237520.0},
'transmission': {'mean': 1.0827709036370428,
'count': 10668,
'variance': 0.58366114},
'mileage': {'mean': 24827.244000749928,
'count': 10668,
'variance': 552497100.0},
'tax': {'mean': 126.0114360704911, 'count': 10668, 'variance': 4511.848},
'mpg': {'mean': 50.770022497330956, 'count': 10668, 'variance': 167.69684},
'engineSize': {'mean': 1.930708661415088,
'count': 10668,
'variance': 0.36355674}},
'NODE_d0e82145-2311-46cb-93f8-922f36f4b71d': {'year': {'mean': 2016.8665738936904,
'count': 17965,
'variance': 4.2039175},
'price': {'mean': 12279.756415251926,
'count': 17965,
'variance': 22480710.0},
'transmission': {'mean': 0.9847481213470743,
'count': 17965,
'variance': 0.13603991},
'mileage': {'mean': 23363.630503757344,
'count': 17965,
'variance': 379163260.0},
'tax': {'mean': 113.33453938213202, 'count': 17965, 'variance': 3845.2944},
'mpg': {'mean': 57.90699137215473, 'count': 17965, 'variance': 102.535416},
'engineSize': {'mean': 1.3508266072919597,
'count': 17965,
'variance': 0.186945}}} # Globally aggregated value for a single stat — same structure as node output
result.global_stats('mean')
{'year': 2016.9537929589344,
'price': 16235.38085425909,
'transmission': 1.0212691649495393,
'mileage': 23908.93937065627,
'tax': 118.05766074110295,
'mpg': 55.2479202319801,
'engineSize': 1.5668773792468906} Filtering Columns with dataset_schema¶
By default FA runs over all columns. Pass a list of column names as dataset_schema to restrict the computation.
# Replace 'age' and 'bmi' with actual column names from your dataset
result = exp.analytics.fetch_stats(stats='mean', dataset_schema=['year', 'price'])
result.global_stats('mean')
{'year': 2016.9620209933992, 'price': 16235.200740670456} Caching¶
Fed-BioMed caches the last FAResult per experiment. Calling fetch_stats again with the same arguments returns the cached result instantly — no network round-trip occurs.
The cache is invalidated when a different statistic, dataset_schema, or stats_args is requested, or after a new training round.
# First call — contacts the nodes
result1 = exp.analytics.fetch_stats('mean')
# Second call — served from cache (instant)
result2 = exp.analytics.fetch_stats('mean')
print('Cached:', result1 is result2) # True
Cached: True