Optimizing and scaling machine learning models for scientific applications on exascale supercomputers
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As software and hardware concurrently advance, it's important to design and build ML frameworks that are hardware architecture agnostic. More specifically as accelerators for ML workflows become more prevalent, the ability to have high level code that can run across such accelerators would be highly beneficial by reducing the need to rewrite code and libraries for each hardware. At the same time, advancement in machine learning (ML) methods has enabled extraction of meaningful information from large and complex datasets that have assisted in better understanding, diagnosing, and treating illnesses such as cancer. This applies to applications beyond oncological drug response and drug discovery including understanding complex plasma physics phenomena. ☐ This thesis focuses on designing and building scalable and portable machine learning-based workflows while adapting them to new hardware architectures. This thesis also includes scaling and improving performance of surrogate models that reduce scientific simulations necessary for extracting insights. A phenomenon increasingly necessary as scientific challenges increase in computational complexity. We demonstrate the ideas using two case studies. ☐ An improved drug discovery pipeline is designed for shorter development timelines through model enhancement and scaling on new hardware capabilities. The thesis investigates gradient boosted tree-based methods as viable alternatives to CNNs in demonstrating the limitations of existing neural network-based drug response models. These gaps are resolved by designing and building software that helps assess the variation in performance of each class of models and includes improvements made to the accessibility of these models for domain experts. The current approaches rely on RNA sequence based gene expression values of cell lines, 2D molecular drug descriptors, and drug response data to predict cell growth. To overcome the challenges faced with the existing 2D molecular datasets, the next aspect of the thesis focuses on improving the performance of ML techniques that synthesize molecular docking of the 3D molecular drug descriptors for pose estimation of protein-ligand binding to reduce the subsequent molecular dynamics simulations needed in drug-discovery workflows. In addition to hyperparameter optimization (HPO) and model tuning, scaling the training of such models will greatly improve the throughput of lead compound discovery. ☐ Scaling such ML workflows on new hardware architectures like AMD GPUs is challenging. The thesis further explores the scaling aspect by using another case study that involves in-transit ML of plasma physics simulations to uncover correlations between emitted radiation and particle dynamics within the simulation. The ML surrogate model employs online learning using data streamed from the simulation and scales up to 400 GPUs. ☐ To summarize, this thesis introduces novel software frameworks and workflows to advance the state-of-the art for case studies involving drug discovery models for cancer research as well as plasma physics simulations through model enhancement and distributed scaling on large supercomputers.
Description
Keywords
Machine learning, Hardware architectures, Supercomputers, Hyperparameter optimization