Beyond lexical towards to better document representations
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
How documents are represented is essential for downstream text applications. In this dissertation, we break the boundary of lexical-based document representation. Endeavor to find new directions in achieving a more comprehensive representation. We explored this profound problem in different scopes, from a single sentence up to the entire corpus, as well as with various aspects, from syntax to semantics. ☐ First of all, we proposed a sentence level unsupervised framework, called Mani, to compute the semantic similarity between two documents by utilizing topological persistence. The core idea of this framework is firstly representing the semantic relationships between constituents (phrases and clauses) of the two documents as a graph, then utilizing topological persistence to extract structural features of this graph, and finally computing the semantic similarity based on these features. Mani is tested on predicting human judgments of semantic similarity between pairs of documents. The experimental results show firm evidence supporting that Mani can produce document semantics comparison results highly consistent with human judges. ☐ Next, we proposed an Ad-hoc retrieval system, called Atlas. In Atlas, the lexical and semantics of documents are represented separately. Atlas leveraged the traditional inverted index to conduct the lexical matching and using distributed semantic vector space to perform the semantic matching. It produces a final document ranking with a sample but an effective linear combination of two matches without compromising the real-time search. Atlas is typically used as a filter model to retrieve query related document as many as possible for the downstream reranking models. However, Atlas can also be used in the end-to-end retrieval scenario. ☐ On a benchmark Ad-hoc collection, Atlas is evaluated with human judged relevance assessments. The result shows that Atlas improves the recall and MAP of top-k relevant document retrieval at a low cost. ☐ Finally, in order to explore the potential of semantic-based document representation, we enhanced distributed semantic vector space by adding spatial structures. This enables us to perform document clustering and topic modeling tasks with the vector space. Our approach is evaluated with multiple measures against competitive methods on a popular dataset for experiments in text applications. Furthermore, our approach is also tested on the same Ad-hoc collection to demonstrate the adaptability and effectiveness for large-scale datasets.
Description
Keywords
Document representations, Document semantic comparison, Information retrieval, Natural language processing, Topic modeling, Topological persistence