AI4IO: a suite of AI-based tools for IO-aware HPC resource management

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Users submit their simulations to High Performance Computing (HPC) clusters through batch systems which allocate cluster resources to user jobs. While some resource managers and job schedulers, such as Slurm, have a generalized resource model, they end up monitoring and managing only computing resources (i.e., nodes) in nearly all modern HPC systems. Other resources, such as parallel file systems, are also important to job execution but resource managers and job schedulers remain blind to their impact on the overall cluster utilization and job performance. For example, contention for IO resources increases job runtime and delays execution. Furthermore, we observe the trend of an increasing gap between compute power and IO bandwidth, meaning that the bandwidth to file systems is outpaced by the rate of data production for IO-intensive applications. These problems can be addressed with IO-aware schedulers. Unfortunately schedulers lack automatic, scalable, and general tools that support and enable IO-awareness by generating knowledge that the schedulers can leverage to prevent and mitigate IO contention while dealing with IO bandwidth constraints. ☐ To address the problems, in this thesis we propose AI4IO, a suite of Artificial Intelligence (AI) based tools that enable resource awareness on HPC systems. AI4IO consists of two tools: PRIONN and CanarIO. PRIONN automates predictions about user-submitted job resource usage; CanarIO detects, in real-time, the presence of IO contention on HPC systems and predicts which jobs are affected by that contention. By working in concert, the AI4IO tools predict the a priori knowledge necessary to prevent and mitigate IO contention with IO-aware scheduling. We leverage the Flux simulator to implement a realistic simulation of a HPC environment and integrate AI4IO in the Flux simulation. We first evaluate PRIONN and CanarIO separately and show that they improve performance with the prevention and mitigation of IO contention. We then use the two A4IO tools in concert to produce greater improvements in performance: we observe up to 6.2% improvement in makespan of real HPC job workloads, which amounts to more than 18,000 node-hours saved per week on a production-size cluster.
Description
Keywords
High Performance Computing clusters, Job scheduling, Resource management, HPC systems, Job productivity, IO-awareness
Citation