Deploying Datasets in Nodes

Deploying datasets in nodes makes the datasets ready for federated training with Fed-BioMed experiments. Dataset deployment in Fed-BioMed essentially means providing metadata for a dataset. One node can deploy multiple datasets. Once deployed, the dataset's metadata is saved into the node's database for persistent storage, even after the node is stopped and restarted.

Each dataset should have the at least following attributes:

  • Database Name: A user-readable short name of the dataset for display purposes.
  • Description: A longer description giving more details about the specifics of each dataset.
  • Tags: A unique identifier used by the federated training process to identify each dataset.

Requirements

Fed-BioMed does not support downloading datasets from remote sources, except for the default MNIST and MedNIST datasets. Therefore, before adding a dataset into a node, please make sure that you already prepared your dataset and saved it on the file system.

Adding a dataset using the Fed-BioMed CLI

Use the following command to add a dataset into the node identified by the config-n1.ini file.

$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini dataset add

Afterward, you will see the following screen in your terminal.

# Using configuration file: config_node.ini #
Welcome to the Fed-BioMed CLI data manager
Please select the data type that you're configuring:
    1) csv
    2) default
    3) mednist
    4) images
    5) medical-folder
    6) flamby
select:

It asks you to select what kind of dataset you would like to add. The default and mednist option are configured to automatically download and add the MNIST and MedNIST datasets respectively. To deploy your own data, you should select csv, image, medical-folder or flamby option according to your needs. After you select an option, you will be prompted with additional questions that cover both common and option-specific details.

For example, let's suppose that you are going to add a CSV dataset. To do that you should type 1 and press enter. The interface will ask you to insert the common elements: name, tags, and description.

Name of the database: My Dataset
Tags (separate them by comma and no spaces): #my-csv-data,#csv-dummy-data
Description: Dummy CSV data

If a graphical interface is available, the next step opens a file browser window and asks you to select your csv file. If a graphical interface is not available, you will be prompted to insert the full path to the file.

After selecting the file, you will be shown the details of your dataset.

Great! Take a look at your data:
name        data_type    tags                                 description     shape      path                                   dataset_id                                    dataset_parameters
----------  -----------  -----------------------------------  --------------  ---------  -----------------  --------------------------------------------  --------------------
My Dataset  csv          ['#my-csv-data', '#csv-dummy-data']  Dummy CSV data  [300, 20]  /path/to/your.csv  dataset_<id>

You can also check the list of deployed datasets by using the following command:

$ ${FEDBIOMED_DIR}/scripts/fedbiomed_run node --config config-n1.ini dataset list

It will return the datasets saved into the node identified by the config-n1.ini file.

How to Add Another Dataset to the Same Node

Nodes can store multiple datasets. You can follow the previous steps as many times as needed to add other datasets.

Adding datasets with the same path

Using the same files or the same path for multiple datasets is allowed, provided that the tags are unique.

Conflicting tags between datasets

Tags from one dataset cannot be equal to, or a subset of, the tags of another dataset

For example, CLI on a node:

  • accepts to register dataset1 with tags [ 'tag1', 'tag3' ], dataset2 with tags [ 'tag1', 'tag2' ] and dataset3 with tags [ 'tag2', 'tag3' ]
  • refuses to register dataset1 with tags [ 'tag1', 'tag2' ] if dataset2 with tags [ 'tag1' ] already exists
  • refuses to register dataset1 with tags [ 'tag1', 'tag2' ] if dataset2 with tags [ 'tag1', 'tag2', 'tag3' ] already exists