Building a predictive modeling system for sentence classification: a case study using tardive dyskinesia

Bi, Xia

Building a predictive modeling system for sentence classification: a case study using tardive dyskinesia

Files

XiaBi_Thesis.pdf (895.39 KB)

Date

2012

Authors

Bi, Xia

Publisher

University of Delaware

Abstract

Advances in computational and biological methods have greatly accelerated the pace of scientific discovery and produced a tremendous amount of experimental and computational data in the biomedical domain. Given the wealth of information that are available both in scientific papers and electronic databases, one particular challenge in biomedicine is to detect disease-drug associations and to organize them in a meaningful way that will accelerate pharmacogenetic research. Several text mining tools have been developed to facilitate this purpose. They perform adequately well in identifying facts and entities using on-the-fly search of scientific articles from many different databases; however, they cannot analyze the type of relationship that exist between the objects identified. In this thesis, we propose a novel method to analyze drug-disease relationships using a combination of in-house and open-source tools that exploit the Multinomial Naïve Bayes (MNB) modeling technique. The main motivation behind this thesis work is to assist researchers to quickly identify disease-drug relationships from the biomedical literature using the case study of tardive dyskinesia (TD) and to classify those relationships into specific categories to enable better understanding of various drug effects. We have manually developed and annotated a biomedical training corpus for TD via sentence classification. Using the MNB modeling technique, we generated a learning model and built a predictive classifier system using data preprocessing and filtering algorithms. To assess whether the model would generalize to an independent dataset, we applied the 10-fold cross-validation method to evaluate the model using precision, recall, F-measure, and ROC area. The precision, recall, and F-measure were approximately 88%, and ROC area was over 97%. One particular challenge in sentence classification is the co-existence of contrasting biological observations that cause confusion to the classification model. To address this ambiguity issue, we passed the output data to Metamap to identify and separate distinct biological observations in biomedical text. By further discerning the semantic meaning of biological observations, we classified biomedical sentences into more refined categories, which helped to elucidate various drug effects and proved to be an initial effort toward the sophisticated task of disease-drug relationship extraction.

URI

http://udspace.udel.edu/handle/19716/12034

Collections

Master's Theses (Fall 2009 to Present)

Full item page