Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)
Date
2017-03-24
Journal Title
Journal ISSN
Volume Title
Publisher
Oxford University Press.
Abstract
The Gene Expression Database (GXD) is a comprehensive online database within the
Mouse Genome Informatics resource, aiming to provide available information about endogenous
gene expression during mouse development. The information stems primarily
from many thousands of biomedical publications that database curators must go
through and read. Given the very large number of biomedical papers published each
year, automatic document classification plays an important role in biomedical research.
Specifically, an effective and efficient document classifier is needed for supporting the
GXD annotation workflow. We present here an effective yet relatively simple classification
scheme, which uses readily available tools while employing feature selection,
aiming to assist curators in identifying publications relevant to GXD. We examine the
performance of our method over a large manually curated dataset, consisting of more
than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while
the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider
image captions, an important information source that we integrate into our method.
We apply a captions-based classifier to a subset of about 3300 documents, for which the
full text of the curated articles is available. The results demonstrate that our proposed approach
is robust and effectively addresses the GXD document classification. Moreover,
using information obtained from image captions clearly improves performance, compared
to title and abstract alone, affirming the utility of image captions as a substantial
evidence source for automatically determining the relevance of biomedical publications
to a specific subject area.
Description
Publisher's PDF
Keywords
Citation
Jiang, Xiangying, et al. "Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)." Database 2017.1 (2017).