Real-time topic detection and tracking in microblog

Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Topic Detection and Tracking (TDT) has been an important segment of Information Retrieval (IR) research. With the availability of faster streams of information, there is an increasing need for recommendation systems that automate the process tracking these live streams and generating personalized digests of information. Microblog services such as Twitter, being a rich and rapid source of mostly-textual data, provide an ideal platform for developing and evaluating real-time TDT and recommendation techniques. In this dissertation, we present a recommendation system that tracks Twitter’s live stream for generating real-time recommendations. Such systems come with stringent list of requirements for high precision, low latency, and minimal redundancy. To decrease time latency, we invert the usual query!document retrieval model. We instead create and index “profiles” formed around a user’s information need (query) and use tweets (documents) to retrieve matching profiles. This inverted model is used to score incoming tweets against indexed profiles to generate several traditional IR scores. These traditional IR scores are then processed through a multilayer neural network to estimate a probability of relevance between an incoming tweet and a profile. To minimize redundancy, we represent these profiles as a collection of clusters of terms, only emitting a tweet if it does not seem to match any cluster in the collection. We also study query difficulty prediction in microblog and introduce new Query Difficulty Predictors (QDPs) that explore the relationship between a query and its n-grams. We extend the traditional QDP evaluation framework by dropping the dependency between a predictor and a retrieval system from the evaluation framework and evaluate the QDPs in a system-independent fashion. We evaluate these predictors along with a few existing web-based predictors using a large number of retrieval systems submitted at the Text REtrieval Conference (TREC) Microblog and Real-Time Summarization tracks. We also use the same TREC data to evaluate our recommendation system. We conduct experiments on several 10-day-long live streams consisting of approximately 12 million English tweets (each) and show that filtering of relevant information even at this scale can be done almost instantaneously. We evaluate our work using standard TREC-style evaluation methodologies and also run crowdsourching experiments on Amazon Mechanical Turk to further evaluate the efficacy and usefulness of our work. Our system performs well across all evaluation metrics. It identifies a good number of tweets in a fully automatic and real-time manner from the stream within seconds of their generation time. From disaster management to keeping sports fans updated, tons of applications can benefit from our work and it is a big step towards making real-time TDT possible.
Description
Keywords
Microblog, Query difficulty prediction, Real-time search, Topic detection and tracking, Twitter
Citation