Mining Structural Information From Chemical Literature
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Artificial intelligence (AI) is increasingly being adopted in the chemical research community for materials discovery, process optimization, and laboratory automation. However, most valuable chemical information remains locked in unstructured scientific literature. Converting this knowledge into machine-readable formats has become a critical bottleneck for accelerating scientific progress. This thesis presents novel approaches for mining structural information from chemical literature. ☐ Our first work focuses on the catalysis domain, particularly plastic upcycling research. We develop a comprehensive named entity recognition (NER) system that extracts six types of catalysis-related entities: catalysts, reactions, reactants, products, characterization methods, and treatments. Our span-based model achieves approximately 90% extraction accuracy and supports downstream applications including an entity-aware search engine and entity correlation analysis system. ☐ To address the challenge of rapidly evolving chemical subdomains, our second work introduces the Simple Span-based Prototypical (SSP) model for few-shot NER. This model combines metric learning with chemical-specific adaptations, enabling effective entity extraction with as few as 5-10 examples per entity type while maintaining computational efficiency. Furthermore, we demonstrate that by leveraging Large Language Models (LLMs) as knowledge sources and denoising their annotation, our model can maintain a performance difference within 5% of human-annotated systems while significantly outperforming standard few-shot LLM baselines, providing a practical pathway for rapid domain adaptation without extensive manual annotation. ☐ Finally, we extend beyond traditional extraction methods to address the complex challenge of transforming literature-reported synthesis protocols into robot instructions for laboratory automation. Our two-stage LLM-based generation pipeline first produces Standard Operating Procedures (SOPs) that resolve cross-references and expand implicit instructions, then transforms these into Machine-Actionable Protocols (MAPs). Training and evaluating on a human-annotated corpus of 250 catalyst synthesis protocols, our approach achieves 46% end-to-end accuracy. We further conduct preliminary experiments to demonstrate that reasoning-capable LLMs could potentially serve as flexible post-processing adapters, correcting extraction errors and enhancing protocols with platform-specific requirements, all without pipeline retraining. ☐ This thesis presents a series of studies aiming at mining structural data from chemical literature. We hope our work will accelerate AI-powered chemical discovery by creating the essential bridge between decades of accumulated scientific knowledge and the machine-readable data required for downstream AI applications.
Description
"At the request of the author or degree granting institution, this graduate work is not available to view or purchase until July 04 2026."--ProQuest abstract/details page.
