Improving verbose queries performance in bio-medical domain
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The medical related area has drawn increasingly attention from both physicians and patients in nowadays. In order to help the users navigate in the digit ocean, Information Retrieval (IR) techniques have advanced greatly together with the development of the Web. Google, for instance, as one of the most successful commercial general web search engines, has greatly reduced the effort for people accessing the information. By analyzing the queries submitted by the users, the general web search engine would try to retrieve diverse results that would better match all query terms, since each query term is considered equally important by the search engine. However, this assumption may not hold for verbose queries, especially the ones from bio-medical domain. The queries in the bio-medical domain are more verbose than the general web query for two reasons. On the one hand, the concepts in the bio-medical domain are complex, which requires more terms to describe. Only match the terms instead of the whole concept, i.e., the term partial matching problem, would hurt the retrieval performance. On the other hand, usually more than one aspect is required in the query. A relevant document should match more and important aspects specified in the query. Consider every term equally does not obey this intuition. Thus, simply treat the verbose bio-medical queries in the same way as the general web queries would lead to a non optimal performance. ☐ In this work, we proposed to process the verbose queries from two different angles in order to improve the retrieval performance, namely the concept based representation and key term identification. ☐ The concept based representation handles the documents and queries as ``bag of concepts'', which is different from the traditional ``bag of terms'' assumption used in general web search. Terms describing the same concept are grouped and converted to a unique concept ID with the help the metathesaurus in this domain. We started the research with directly apply the domain knowledge base to fulfill users' information needs. The results outperform the baseline methods, which indicate that the concept based representation could successfully overcome the term partial match problem in term based representation. We further investigate how to conduct query expansion in the concept based representation. Lastly, since the domain knowledge is also maintained by human, it could also suffer from the limitations introduced by human. We also explored in the direction that how to correct those limitations when apply the concept based representation when using domain knowledge base. The experiment results could significantly outperform the strong baseline in term based representation. ☐ Different from solving verbose query processing problem by converting the queries and documents into concept based representation, we also studied how to select the important terms from the verbose queries automatically. Guided by the observation that not every query term is equally important in the verbose query, we formulate this problem as a classification problem: for each term in the verbose query, we try to identify if it could be helpful for the retrieval propose. We trained a logistic regression classifier with 16 features designed for this domain. The results show that the retrieval performance using the selected key terms is comparable with the ones using the human-created simplified queries. In addition, we further studied the possible ways to select expansion terms for the selected key terms. Since the expansion terms selected from the external resources suffer from the correctness issue, we conducted the expansion within the original document collections using the locality information. Experiment results over 5 data collections show that the improvement is significant comparing with the baseline methods.