Skip to content
Home > Supercomputer Federation for AI

Supercomputer Federation for AI

The “FL across Supercomputers” (FLaS) experiment

EuroHPC user day


The Federated Learning ​​across Supercomputing Experiment (FLaS) has been designed to explore three research directions:

  1. The usage of supercomputers to train and fine-tune foundational AI models.
  2. The possibility of using federated supercomputers to go beyond the computing capability of a single site by aggregating the computing power from different sites to solve more significant problems (e.g., training larger models).
  3. Federated Learning (FL) can virtually pool different datasets while keeping each one private to its data owner.

To our knowledge, FLaS is the first experiment of federated training of foundational models across supercomputers.

Visualizing Federated Learning (FL)

Recent years have been characterized by crucial advances in Artificial Intelligence (AI) systems. The ubiquitous availability of data sets and processing elements supported these advances. The consequent deployment of ML methods throughout many industries has been a welcome innovation, albeit one that generated newfound concerns across multiple dimensions, such as performance, energy efficiency, privacy, criticality, and security. Concerns about data access and movement are particularly felt by industrial sectors such as healthcare, defense, finance, and any other sector treating sensitive data of kind or the other. Federated Learning (FL) is a learning paradigm where multiple parties (clients) collaborate in solving a learning task using their private data. Importantly, each client’s local data never leaves the client’s systems since, in its most common configuration, clients collaborate by exchanging local models instead of moving the data. The aggregator collects and aggregates the local models to produce a global model. The global model is then sent back to the clients, who use it to update their local models. Then, using their private data, they further update the local model. This process is repeated until the global model converges to a satisfactory solution or a maximum number of rounds.

The FLaS experiment explained

The FLaS experiment aims to train the LLaMA-2 model on two EuroHPC supercomputers using the FL approach: Leonardo at CINECA (Italy) and Karolina at IT4Innovations (Czech Rep.). LLaMA-2 (Large Language Model Meta AI) is a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. The experiment is a single bilingual model using two distinct language corpora, which are kept private at each site:

  • The Italian Clean Common Crawl’s web crawl corpus (clean_mc4_it – 102GB), which is kept private in Italy within Leonardo storage;
  • The Czech Common Crawl’s web crawl corpus (mc4_cs – 169GB), is kept private in the Czech Republic within Karolina storage.
FLaS experiment explained

The FL protocol is described by a cyclic workflow using the Common Workflow Language (CWL) and executed thanks to the Streamflow implementation [2], which supports distributed cloud-HPC workflow deployments. The execution of the FLaS workflow happens under the control of a Streamflow orchestrator running on a cloud VM, which produces SLURM jobs with input data (aggregated LLaMA model) to be submitted to different supercomputers and gathers results of the jobs (a set of LLaMA models, each of them further trained on each supercomputer) moving them to the orchestrator in the cloud for the aggregation step. Training and validation datasets remain stored on the owner’s supercomputer; they are never moved.

Methods and Tools

The FLaS experiment is a medium-scale experiment constructed on a previous small-scale experiment that appeared in the “Federated Learning meets HPC and cloud” paper [1], which described the federated training of VGG16 across the Marconi100 supercomputer (with MNIST dataset) and the HPC4AI@UNITO Openstack cloud (with the SVHN dataset). The approach is general enough to be reused in the FLaS medium-scale experiment, which, thanks to the flexibility offered by Streamflow, works with two supercomputers and much larger networks and datasets and works with the same workflow used in [1]. This does not mean that the two experiments exhibit the same deployment complexity: scalability is not a free lunch in nature, and the FLaS experiment itself is designed to study scalability problems of AI on supercomputers and to distill the methodology to make the training of foundational models scalable across one or more supercomputers.

LLaMA-2 (described below) exploits both model and data parallelism. Model parallelism is mostly used to fit the network in memory since it easily exceeds the memory capacity of any modern GPU. Model parallelism is exploited across GPUs of the same node (min 4xA100), whereas data parallelism (in a distributed learning fashion) is exploited (via MPI) across different nodes to reduce training time.


The StreamFlow framework is a container-native Workflow Management System written in Python3 and based on the Common Workflow Language (CWL) open standard. Developed and maintained by the Parallel Computing research group at Università di Torino (UniTO), it has been designed around two main principles:

  • Allowing the execution of a workflow in multi-container environments to support the concurrent execution of multiple communicating steps across distributed heterogenous infrastructures: supercomputers, cloud, private clusters, and even laptops.
  • Realizing full separation of concerns between workflow business logic (steps and their data dependencies) and deployment plan. The same workflow can be deployed in a different platform or set of platforms by changing only the deployment plan of workflow steps while maintaining the business logic unchanged. Deployment modules can be plugged in to extend the default methods SLURM, PBS, LFS, K8S, AWS, OpenStack, SSH, etc.
  • Collective operators (scatter, gather) and cyclic workflows are first-class concepts in CWL and Streamflow. Thanks to CAPIO, we are introducing streaming across successive workflow steps.
  • Relaxing the requirement of a single shared data space to allow for hybrid workflow executions on top of multi-cloud or hybrid cloud/HPC infrastructures.

StreamFlow source code is available on GitHub under the LGPLv3 license. Moreover, a Python package is downloadable from PyPI and Docker containers can be found on Docker Hub.

The Large Language Model Meta AI (LLaMA) model

Over the last year, large language models — natural language processing (NLP) systems with billions of parameters — have shown new capabilities to generate creative text, solve mathematical theorems, predict protein structures, answer reading comprehension questions, and more. They are one of the clearest cases of the substantial potential benefits AI can offer at scale to billions of people. Even with all the recent advancements in large language models, full research access to them remains limited because of the required resources to train and run such large models. This restricted access has limited researchers’ ability to understand how and why these large language models work, hindering efforts to improve their robustness and mitigate known issues, such as bias, toxicity, and the potential for generating misinformation [7].

Meta publicly released LLaMA (Large Language Model Meta AI) in Feb 2023, a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. LLaMA is available in several sizes (7B, 13B, 33B, and 65B parameters). LLaMA-2, released in Lug 2023, is an updated version of LLaMA that is pre-trained on a new mix of publicly available data. Also, LLaMA-2 is pre-trained with a corpus increased by 40%, with a doubled context length and grouped-query attention. LLaMA-2 is released with 7B, 13B, and 70B parameters [8].


Notice that the two corpora are public and not even massive data sets. In principle, they can be moved to a single supercomputer to exploit a traditional training of the LLaMA model. Nevertheless, the case represents a paradigmatic example of the privacy-preserving federated training or fine-tuning of a foundational model starting from a distributed pool of private datasets, each resident in a different site.

LLaMA-2 model is available in several sizes (7B, 13B, 33B, and 65B parameters). For experimentation speed, we used the 7B version; the largest versions require more experimentation time/or GPUs (that, in turn, increase the job waiting time in the submission queue).


Single supercomputer (Leonardo) distributed training benchmark (all figures are referred to a single epoch of training).


Scalability in a federated execution is much harder to achieve. In addition to the overheads of a distributed execution on a single supercomputer, FL exhibits several other sources of potential overheads. At each round:

  • Model exchange time across a geographic network
  • Aggregation time (either centralized or distributed)
  • Load imbalance between different sites due to different dataset sizes and/or different computing power.
  • Job queue waiting time (typically the big problem)

Lesson Learned

Modern AI depends on pre-trained models, which evolve very rapidly. The latest LLaMA-2 model can be trained only using the night build of Pytorch. Supercomputer users will likely be required to rebuild their tools and libraries frequently.

HPC federation that passes through standard workload managers (batch job queues) requires “some” synchronization between different supercomputers, e.g., using the BookedSLURM extension (that leverages advanced reservation).

Two-factor authentication to access supercomputers needs to be automatized for distributed workflow execution. The synchronization across different infrastructures should not depend on mobile phone authentication.


Gianluca Mittone, Alberto Mulone, Giulio Malenza, Robert Birke, Iacopo Colonnelli, Marco Aldinucci – Parallel Computing research group, University of Torino


Data resources: Valerio Basile, Marco Antonio Stranisci, Viviana Patti – University of Torino, Italy
Access to Leonardo supercomputer: Sanzio Bassini, Gabriella Scipione, Massimiliano Guarrasi – CINECA, Italy
Access to Carolina supercomputer: Jan Martinovic, Vit Vondrak – IT4AI, Czech Republic


  1. I. Colonnelli, B. Casella, G. Mittone, Y. Arfat, B. Cantalupo, R. Esposito, A. R. Martinelli, D. Medić, and M. Aldinucci, “Federated Learning meets HPC and cloud,” in Astrophysics and Space Science Proceedings, Catania, Italy, 2023, p. 193–199.  doi:10.1007/978-3-031-34167-0_39  [Download PDF]
  2. I. Colonnelli, B. Cantalupo, I. Merelli, and M. Aldinucci, “StreamFlow: cross-breeding cloud with HPC,” IEEE Transactions on Emerging Topics in Computing, vol. 9, iss. 4, p. 1723–1737, 2021. doi:10.1109/TETC.2020.3019202 [Download PDF]
  3. Colonnelli, B. Cantalupo, R. Esposito, M. Pennisi, C. Spampinato, and M. Aldinucci, “HPC Application Cloudification: The StreamFlow Toolkit,” in 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021), Dagstuhl, Germany, 2021, p. 5:1–5:13.  doi:10.4230/OASIcs.PARMA-DITAM.2021.5 [Download PDF]
  4. M. Aldinucci, E. M. Baralis, V. Cardellini, I. Colonnelli, M. Danelutto, S. Decherchi, G. D. Modica, L. Ferrucci, M. Gribaudo, F. Iannone, M. Lapegna, D. Medic, G. Muscianisi, F. Righetti, E. Sciacca, N. Tonellotto, M. Tortonesi, P. Trunfio, and T. Vardanega, “A Systematic Mapping Study of Italian Research on Workflows,” in Proceedings of the SC ’23 Workshops of The International Conference on High-Performance Computing, Network, Storage, and Analysis, SC-W 2023, Denver, CO, USA, 2023, p. 2065–2076.  doi:10.1145/3624062.3624285 [Download PDF]
  5. G. Mittone, W. Riviera, I. Colonnelli, R. Birke, and M. Aldinucci, “Model-Agnostic Federated Learning,” in Euro-Par 2023: Parallel Processing, Limassol, Cyprus, 2023.  doi:10.1007/978-3-031-39698-4_26 [Download PDF]
  6. M. Pennisi, F. Proietto Salanitri, G. Bellitto, B. Casella, M. Aldinucci, S. Palazzo, and C. Spampinato, “FedER: Federated Learning through Experience Replay and Privacy-Preserving Data Synthesis,” Computer Vision and Image Understanding, 2023.  doi:10.1016/j.cviu.2023.103882 [Download PDF]
  7. Meta AI, “Introducing LLaMA: A foundational, 65-billion-parameter large language model,” last accessed Dec 2023,
  8. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poul- ton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “LLaMA 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. [Download PDF]
VSB-Tecnical University of Ostrava
EPI- European Processor Initiative