Browsing by Author "Li, Pengyuan"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Utilizing image and caption information for biomedical document classification(Bioinformatics, 2021-07-12) Li, Pengyuan; Jiang, Xiangying; Zhang, Gongbo; Trabucco, Juan Trelles; Raciti, Daniela; Smith, Cynthia; Ringwald, Martin; Marai, G. Elisabeta; Arighi, Cecilia; Shatkay, HagitMotivation: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. Results: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. Availability and implementation: Source code and the list of PMIDs of the publications in our datasets are available upon request.Item Utilizing image information for biomedical document classification(University of Delaware, 2021) Li, PengyuanBiomedical research findings are typically disseminated through publications. To simplify access to domain specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature. The first step in the biocuration process is to identify articles relevant to the specific area on which the database is focused within a large volume of publications -- which is a labor intensive and slow process. Thus, automatically identifying publications that are relevant to a specific topic is one of the fundamental tasks toward expediting the biocuration process and, in turn, biomedical research. ☐ Current methods for categorization of biomedical documents focus on textual contents, typically extracted from the title and the abstract. Notably, images and captions are often used in publications to convey pivotal information about research processes, experiments and results. In this thesis, we explore means for utilizing and integrating image information into biomedical document classification. To do that, we first develop a new and effective system for extracting figures and their captions from biomedical publications. The vast majority of extracted figures are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information. In order to use the image information from each individual panel, we propose an efficient and effective method to separate those compound images into their constituent panels. Last, we introduce a new biomedical document classification scheme that uses information derived from images, captions, in addition to titles-and-abstracts.