Modeling non-determinism of scientific applications
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As the scientific community prepares to deploy an increasingly complex and di- verse set of applications on upcoming exascale platforms, the need for methods to assess reproducibility of simulations and identify the root causes of reproducibility failures in- creases correspondingly. One of the greatest challenges facing reproducibility efforts at exascale is unavoidable application-level non-determinism at the level of inter-process communication. While often necessary to boost performance, use of non-deterministic communication constructs can hamper reproducibility due to the interaction between communication non-determinism and floating-point non-associativity. ☐ In this thesis we address the challenge of non-determinism in scientific appli- cations along three strategic directions. First, we assess the landscape of existing tooling and infrastructure for managing non-determinism via record-and-replay, and in doing so produce evidence suggesting the need for record-and-replay to adapt to communication patterns of non-deterministic applications at exascale. Second, we as- sess the landscape of techniques for alleviating non-determinism’s detrimental effects on numerical reproducibility, and in so doing provide an experimental framework for efficiently compensating for non-determinism based on characteristics of an applica- tion’s floating-point data. Third, we propose and develop a methodology for model- ing communication non-determinism. Our methodology models parallel executions as directed graphs and leverages graph kernels to quantify and characterize run-to-run variations in inter-process communication. To validate our methodology, we present empirical studies showing the utility of graph kernel similarity for quantifying the de- gree of non-determinism present in representative communication patterns. To test the effectiveness of our approach, we present a study on a representative adaptive mesh refinement application demonstrating that our methodology can link runtime mani- festations of communication non-determinism to their root causes in source code, and thus alleviate the burden computational scientists of tracking down potential sources of reproducibility failures in complex code bases.
