Modeling non-determinism of scientific applications

Chapp, Dylan

Modeling non-determinism of scientific applications

Files

Chapp_udel_0060D_14232.pdf (13.11 MB)

Date

2020

Authors

Chapp, Dylan

Publisher

University of Delaware

Abstract

As the scientific community prepares to deploy an increasingly complex and di- verse set of applications on upcoming exascale platforms, the need for methods to assess reproducibility of simulations and identify the root causes of reproducibility failures in- creases correspondingly. One of the greatest challenges facing reproducibility efforts at exascale is unavoidable application-level non-determinism at the level of inter-process communication. While often necessary to boost performance, use of non-deterministic communication constructs can hamper reproducibility due to the interaction between communication non-determinism and floating-point non-associativity. ☐ In this thesis we address the challenge of non-determinism in scientific appli- cations along three strategic directions. First, we assess the landscape of existing tooling and infrastructure for managing non-determinism via record-and-replay, and in doing so produce evidence suggesting the need for record-and-replay to adapt to communication patterns of non-deterministic applications at exascale. Second, we as- sess the landscape of techniques for alleviating non-determinism’s detrimental effects on numerical reproducibility, and in so doing provide an experimental framework for efficiently compensating for non-determinism based on characteristics of an applica- tion’s floating-point data. Third, we propose and develop a methodology for model- ing communication non-determinism. Our methodology models parallel executions as directed graphs and leverages graph kernels to quantify and characterize run-to-run variations in inter-process communication. To validate our methodology, we present empirical studies showing the utility of graph kernel similarity for quantifying the de- gree of non-determinism present in representative communication patterns. To test the effectiveness of our approach, we present a study on a representative adaptive mesh refinement application demonstrating that our methodology can link runtime mani- festations of communication non-determinism to their root causes in source code, and thus alleviate the burden computational scientists of tracking down potential sources of reproducibility failures in complex code bases.

Keywords

Graph kernels, Graph similarity, High performance computing, Non-determinism

URI

https://udspace.udel.edu/handle/19716/27969

Collections

Doctoral Dissertations (Winter 2014 to Present)

Full item page