Toward effective biomedical document classification for supporting the biocuration workflow

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Scientific literature is an important source of knowledge supporting biomedical research. The large and rapidly increasing number of publications makes automated biomedical document classification become useful and essential in biomedical research. Effective biomedical document classifiers are especially needed in the biodatabases, such as Mouse Genome Informatics (MGI) database, Flybase and UniProt, as much information in such databases are manually collected from the publications. This is a slow, labor-intensive process that can benefit from automation. ☐ We propose machine learning methods for addressing biomedical document classification for supporting biodatabases workflow. We present our work in the context of Gene Expression Database (GXD) in MGI, which is the largest comprehensive dataset concerning expression information in the mouse. We first develop a simple yet effective classifier employing statistical feature selection aiming to identify publications relevant to GXD over a large balanced dataset. However, biodatabases are typically highly imbalanced. To address class imbalance, we then present a modied meta-classification framework employing clustering-based under-sampling along with our feature selection strategies. Notably, the majority of previous proposed biomedical document classifiers only use text information extracted from the title and abstract of the publication. However, as our group and several others noted, images provide substantial information for determining the topics discussed in the publications. As such, improving on the method for imbalanced biomedical document classification described above, we introduce a classification scheme incorporating features gathered from image captions, in addition to that obtained from titles-and-abstracts. Experiment results demonstrate that our proposed classification frameworks effectively address the biomedical document classification for supporting biodatabases curation workflow.
Description
Keywords
Biomedical document classification, Biocuration workflow
Citation