Multi-cluster resource usage optimization by
dynamic compute node borrowing
Resource management frameworks such as Kubernetes provide an abstracted view on resources and can manage workload execution on these resources. Workloads can be web applications but also data processing jobs. A scheduler takes care of assigning workloads to compute nodes. The Resource management frameworks take care of decommisioning nodes that fail, e.g. due to defect hardware, integrating new compute nodes, and restarting failed workloads.
A resource management framework is a critical piece of infrastructure. Incidents such as the framework’s failure or security breaches to the framework’s administration interface can cause severe problems for the workloads that run in the framework. One approach to lower the risk and impact is to setup multiple clusters for different organizational units and environments, e.g. production and testing. However, the downsinde of this approach is that the resource usage may no longer be optimal. For example, a cluster A may be fully utilized and in need for additional compute nodes while a cluster B has free compute nodes (see figure).
In such a multi-cluster setup resource usage can be optimized while maintaining cluster separation by allowing currently not used compute nodes to be borrowed by other clusters. This requires a controlling entity that
- recognizes if a cluster would benefit from additional compute nodes;
- can determin whether there are free compute nodes in other clusters;
- can move compute nodes between clusters; and
- can return borrowed compute nodes when to their home cluster when they are needed there.
In this bachelor thesis a concept for such a controlling entity should be developed and a prototypically implemented. The prototype implementation is done on the basis of Kubernetes clusters.