AI4IO: a suite of AI-based tools for IO-aware HPC resource management

Author(s)Wyatt, Michael R., II
Date Accessioned2021-01-20T13:19:28Z
Date Available2021-01-20T13:19:28Z
Publication Date2020
SWORD Update2020-09-06T16:04:41Z
AbstractUsers submit their simulations to High Performance Computing (HPC) clusters through batch systems which allocate cluster resources to user jobs. While some resource managers and job schedulers, such as Slurm, have a generalized resource model, they end up monitoring and managing only computing resources (i.e., nodes) in nearly all modern HPC systems. Other resources, such as parallel file systems, are also important to job execution but resource managers and job schedulers remain blind to their impact on the overall cluster utilization and job performance. For example, contention for IO resources increases job runtime and delays execution. Furthermore, we observe the trend of an increasing gap between compute power and IO bandwidth, meaning that the bandwidth to file systems is outpaced by the rate of data production for IO-intensive applications. These problems can be addressed with IO-aware schedulers. Unfortunately schedulers lack automatic, scalable, and general tools that support and enable IO-awareness by generating knowledge that the schedulers can leverage to prevent and mitigate IO contention while dealing with IO bandwidth constraints. ☐ To address the problems, in this thesis we propose AI4IO, a suite of Artificial Intelligence (AI) based tools that enable resource awareness on HPC systems. AI4IO consists of two tools: PRIONN and CanarIO. PRIONN automates predictions about user-submitted job resource usage; CanarIO detects, in real-time, the presence of IO contention on HPC systems and predicts which jobs are affected by that contention. By working in concert, the AI4IO tools predict the a priori knowledge necessary to prevent and mitigate IO contention with IO-aware scheduling. We leverage the Flux simulator to implement a realistic simulation of a HPC environment and integrate AI4IO in the Flux simulation. We first evaluate PRIONN and CanarIO separately and show that they improve performance with the prevention and mitigation of IO contention. We then use the two A4IO tools in concert to produce greater improvements in performance: we observe up to 6.2% improvement in makespan of real HPC job workloads, which amounts to more than 18,000 node-hours saved per week on a production-size cluster.en_US
AdvisorTaufer, Michela
DegreePh.D.
DepartmentUniversity of Delaware, Department of Computer and Information Sciences
DOIhttps://doi.org/10.58088/dmz2-jj63
Unique Identifier1232078449
URLhttps://udspace.udel.edu/handle/19716/28507
Languageen
PublisherUniversity of Delawareen_US
URIhttps://login.udel.idm.oclc.org/login?url=https://www.proquest.com/dissertations-theses/ai4io-suite-ai-based-tools-io-aware-hpc-resource/docview/2454399318/se-2?accountid=10457
KeywordsHigh Performance Computing clustersen_US
KeywordsJob schedulingen_US
KeywordsResource managementen_US
KeywordsHPC systemsen_US
KeywordsJob productivityen_US
KeywordsIO-awarenessen_US
TitleAI4IO: a suite of AI-based tools for IO-aware HPC resource managementen_US
TypeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wyatt_udel_0060D_14235.pdf
Size:
2.07 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: