Addressing challenges in computational catalysis using interpretable machine learning and software development

Date
2023
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Heterogeneous catalysis has made significant contributions to industrial processes and is poised to play an important role in the transition to a circular and greener economy relying on chemicals and processes based on sustainable feedstock alternatives to petroleum. However, significant challenges and opportunities related to uncertainty, automation and discovery remain. Another recent and exciting development has been the adoption of machine learning (ML) models in different domains of science. ML models learn from data and have the potential to unlock exciting new insights in catalysis and be pivotal in overcoming some of its challenges. However, there are important limitations to the use of ML models, having to do with interpretability, data availability and accessibility. ☐ In this thesis, we develop ML models to address some of the challenges in computational catalysis, while avoiding some of the pitfalls of such approaches. We introduce the thesis by providing a wide discussion about the current state of ML applied to core challenges in catalysis and the untapped potential in these areas. In Chapter 2, we develop a linear and interpretable model to fuse thermochemical quantities of interest (QoI) from different fidelities of density functional theory (DFT) with chemical accuracy. We show that subgraph frequencies of molecular graphs, more commonly referred to as group additivity, provide a natural framework for such a task. In Chapter 3, we utilize the framework of subgraph frequencies to predict the error in enthalpies of formation of 2000 molecules calculated using two different functionals commonly used in heterogeneous catalysis. We do this by building a database of the enthalpies from the NIST database and comparing it against the calculated values. Our model reduce error in these values by an order of magnitude. Having a linear model with interpretable features enables us to reason about limitations of the model, as well as gives an intuitive indication of what the model parameters mean. ☐ In Chapter 4, we develop AIMSim, an open-source software for carrying out similarity analyses. Complex tasks such as clustering, outlier analysis, similarity quantification, and querying databases are performed with highly accessible interfaces. Moreover, we also develop and implement an automatic similarity metric and fingerprint selection based on optimizing over a statistical measure of association between molecules. AIMSim also features an intuitive Graphic User Interface (GUI) for code-free analysis, making it accessible to a wide base of researchers. ☐ In Chapter 5, we propose Iris, a deep learning model for detecting peaks in vibrational spectra. The model addresses the challenge of a generalizable high throughput way of automating spectral analysis workflows. Although highly complex, the model is completely trained on synthetic data, overcoming the challenge of collecting and annotating large volumes of data from experiments. The model operates on spectra in real-time, enabling its use for automated analysis of large amounts of data typical in contexts such as time resolved operando data. To get an intuitive understanding of the model, we also investigate the effect of training data on its performance. ☐ The thesis focus has been on developing and investigating interpretable ML and statistical methods, as well as highly accessible software specifically tuned to challenges in the computational catalysis community.
Description
Keywords
Catalysis, Cheminformatics, Data science, Machine learning, Modeling, Software
Citation