The challenge
Without a leading, policy-compliant system, oversizing, “zombie” and idle resources, and hidden bottlenecks can occur, leading to 100 percent utilization. The FEC is used institution-wide in self-service as a productive demonstrator. This means that it serves as an example system that companies can use to test and understand how edge and cloud platforms are set up and operated in a practical manner. In practice, varying levels of prior knowledge lead to oversized virtual machines, incorrect limits and quotas, and resources that continue to run unused or are available but currently do nothing. We refer to unused resources as “zombie” instances. We refer to resources that are available but currently inactive as idle instances. In addition, there are bottlenecks that are difficult to identify in everyday use. At Fraunhofer IPT, utilization is sometimes close to 100%, which creates capacity bottlenecks, deadline risks, and high operating costs. What is needed, therefore, is a transparent and policy-compliant system that guides users, supports governance, FinOps, and sustainability, methodically secures decisions, and makes the complexity of OpenStack manageable. By methodical security, we mean operations research, i.e., the targeted use of mathematical optimization to prepare robust decisions among multiple objectives and constraints.
Our contribution
ARRC provides prioritized rightsizing and shutdown recommendations based on explainable AI, embedded in guidelines and automatically rolled out with GitOps. We adapt ARRC to the OpenStack of the Fraunhofer Edge Cloud and integrate historical and current monitoring data. This data is used to generate easy-to-understand, prioritized recommendations for the correct size of resources and for shutting down unused workloads. We mirror the recommendations as issues in GitLab or Jira so that teams can review them directly. Approved changes are then automatically rolled out via GitOps. Explainable artificial intelligence explains which characteristics influenced the recommendation. At the same time, clear guidelines ensure that limits are adhered to. These include service level agreements, budget requirements, and security requirements. Operations research uses this information to create action plans that comply with capacity and policy requirements. It uses methods of integer and multi-objective optimization as well as fixed constraints. The result is a plan that makes efficient use of capacity and enables a sensibly balanced oversubscription.
The result
ARRC has been proven to increase available resources, reduce costs and energy consumption, and create capacity for new projects. In the proof of concept at Fraunhofer IPT, i.e., in a practical feasibility study, we were able to significantly increase available resources through freeing up space and targeted rightsizing. Up to 363 percent additional CPUs and up to 336 percent additional RAM were available. ARRC has achieved Technology Readiness Level 6. This means that the technology has been successfully demonstrated in a relevant environment. The solution simplifies operational management, reduces zombie and idle resources, and, through productive operation as a reference system, strengthens the transfer to standard enterprise edge and cloud platforms.
The partners
- Fraunhofer IPT
Contact: Dr.-Ing. Mario Pothen, Business Unit Digitalization and Networking