Massive growth in the cloud computing industry over the last decade has been accompanied by the rise of many new applications. Increasingly, such applications require that data not be shared publicly or rely on specialized hardware that needs to be placed in locations close to the user. In domains such as smart cities, industrial automation and data analytics, for example, this means that user requirements like ultra-low latency, security and location awareness are becoming ever more common.
Modern cloud applications have also become more complex, as they usually run on distributed computer systems and are divided into software components that must run at high availability. DECICE is addressing two major challenges that have resulted from this approach: (1) Unifying such diverse systems into centrally controlled compute clusters, and (2) providing sophisticated scheduling decisions across those compute clusters.
Scheduling decisions for a cluster consisting of cloud and edge nodes must consider unique characteristics such as variability in node and network capacity. The common solution for orchestrating large clusters is Kubernetes, an open-source framework for deploying, scaling, and managing containerized applications. However, Kubernetes is designed for reliable homogeneous clusters. Moreover, although many applications and extensions are available for Kubernetes, none is designed to optimize both performance and energy usage, or to address data and job locality.
In DECICE, we are developing an open and portable cloud management framework that automatically and adaptively optimizes applications by mapping jobs to the most suitable compute resources in a heterogeneous system landscape to ensure user-imposed requirements such as time-to-solution. Utilizing holistic monitoring, we construct a digital twin of the system that continually analyses the original system. An AI-scheduler makes decisions to optimize the distribution of jobs and data across the system. In addition, a virtual training environment generates test data for the training of machine learning models and the exploration of what-if scenarios. The portable framework is integrated into the Kubernetes ecosystem and validated using relevant use cases on real-world heterogeneous systems.
HLRS, working alongside its partners GWDG and KTH, is providing infrastructure for DECICE, as well as its expertise in cloud computing, high-performance computing, and HPC system operation. Specifically, HLRS is leading a work package focused on cloud management framework integration, which includes the coordination of monitoring, HPC services, AI training, and inference workflows.
01. December 2022 - 30. November 2025
decice.eu
Artificial Intelligence & Data Analytics
Cloud Computing
Optimization & Scalability
Converged Computing
EC Horizon Europe
See all projects
Head, Converged Computing