Author: Ren, Jia
Citable URI: https://udspace.udel.edu/handle/19716/28158
Advisor: Wu, Cathy H.
Department: University of Delaware, Center for Bioinformatics and Computational Biology
Publisher: University of Delaware
Date Issued: 2020
Abstract: Numerous efforts have been made for developing text-mining tools to extract information from biomedical text automatically. They have assisted in many biological tasks, such as database curation and hypothesis generation. However, main challenges exist in using text-mining tools for large-scale automatic information extraction for knowledge discovery and curation. First, text-mining tools are usually different from each other in terms of programming language, system dependency and input/output format, requiring a lot of engineering efforts to use them in a single large-scale data processing framework and consolidate their results. Secondly, the text-mining results unavoidably contain errors and hinder their usage for knowledge discovery and fast curation. Last but not least, the text-mining results are usually disseminated in a different venue than the one where the documents are originally published, e.g., European PMC, making it difficult for users to quickly obtain the information while reading the papers. ☐ In this dissertation, we describe our efforts to address the three challenges. First, we develop the iTextMine system with an automated workflow to run multiple text-mining tools on large-scale text for knowledge extraction. We employ parallel processing for dockerized text-mining tools with a standardized JSON output format and implement a text alignment algorithm to solve the text discrepancy for result integration. Currently, iTextMine consists of four relation extraction tools and has processed all the Medline abstracts and PMC open access full-length articles. ☐ To remove errors and improve result quality, we further develop several post-processing modules to filter, evaluate, and aggregate the extracted relations. We integrate several tools to label negation, hedging, and citation in a sentence, and mark the relations affected by these phenomena. A confidence module with state-of-the-art deep learning methods is developed to assign confidence scores to relations extracted by rule-based text-mining tools. We compare the performance of several models for well-calibrated confidence scores. These add-on steps produce higher quality annotations and allow them to be ranked based on confidence to facilitate curation. ☐ Last but not least, we explore a popular web annotation system to disseminate iTextMine result to broaden curator community. We demonstrate how to submit the annotations to publisher website at post-publication stage. Meanwhile, we showcase how our existing pipeline can be modified to annotate pre-publication biomedical text. ☐ The iTextMine website and its data APIs is available at URL: http://research.bioinformatics.udel.edu/itextmine
Ren_udel_0060D_14048.pdf | Size: 2.390Mb | Format: PDF |