Document semantic representation: an algebraic topological approach

Date
2019
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Document semantics representation bridging human-written texts and machine text understanding is one of the most cutting-edge areas in Natural Language Processing, Information Retrieval and Computational Linguistics. Geoffrey Hinton profiles the future of machine text understanding with a picture of human-level machine reasoning rooted in document semantics representations. ☐ Over several decades document semantics representation research has formed into two branches, distributional semantics and syntactic structures. Distributional semantics achieves document semantics representations by representing lexical semantics through a following idea: "You shall know a word by the company it keeps". This is implemented by using statistical information on word (co-)occurrences. To date, word embedding techniques have dominated this branch, and are mostly based on neural network models. Document semantics representations then are built upon word embedding vectors, and thus are also vectorized, which is convenient for downstream applications. On the other hand, document semantics representations rooted in syntactic structures represent syntactic relations between words in a sentence such as syntactic dependencies and constituent structures. Critical in reflecting document semantics such document semantics representations capture more sophisticated relations between words than statistical information of word (co-)occurrences. Document semantics representations in this branch are typically designed as trees or sets of relations of sentences. Unfortunately, both of these branches have critical drawbacks. Most word embedding methods have to confront subtleness in training, while most syntactic-structure-based methods have to fill the gap between sentence semantic representation and document semantic representation. ☐ Considering the pros and cons of these two branches, naturally designing document semantics representations that have vector forms, that utilize syntactic structures, and that do not require training is the objective of this dissertation. To address this problem, we introduce a new element, named generalized phrase that is extracted from constituency-based parse trees, as the major ingredient to construct our document semantics representations. Toward establishing the effectiveness of generalized phrases in reflecting document semantics, we propose a new document semantics comparison method that extracts and utilizes generalized phrases. This method is named DSCTP, and is developed from algebraic topology particularly focusing on persistence. We test DSCTP on both document semantics comparison tasks and hard document clustering tasks. The experimental results show that, on the document semantics comparison tasks, DSCTP can provide competitive, and some cases better, performance than state-of-the-art methods; and, on the hard document clustering tasks, DSCTP can significantly outperform state-of-the-art methods. Based on generalized phrases extracted by DSCTP we propose two new document semantics representations in vector forms that do not require training. One is constructed on a clustering of generalized phrases, and the other is constructed by using graph signal processing techniques. We name them abstract phrase vector and generalized phrase graph signal respectively. The experimental results on the hard document clustering tasks show that both abstract phrase vector and generalized phrase graph signal perform as well as, and some cases outperform, state-of-the-art document semantics representations.
Description
Keywords
Citation