Towards automatic refereeing systems through deep event detection in soccer game videos

Date
2021
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
In this decade, emerging technologies such as deep learning have become crucial in video analysis to understanding the action and event caused by human interactions. Rapidly and precisely detecting/recognizing events and participants is an important and challenging problem in various areas, among which sports -- accurate and timely judgements are expected by all games. This dissertation aims at developing new systems of action and event detection in soccer games and making progress towards automatic refereeing systems. ☐ Firstly, we propose an approach for detecting events in untrimmed soccer game videos. The game videos are captured by multiple fixed cameras and do not contain shot boundaries. To obtain more precise results, we propose a network built upon inflated 3D (I3D) ConvNets for video action recognition to detect and differentiate these events, and two novel grouping methods for localizing the boundaries of events. Comprehensive evaluations indicate that our approach achieves fairly good performance. ☐ Secondly, based on the annotated foul participants on the static frames at the foul moment, we show our detection experiments for identifying foul subjects and objects. The detection experiments compare the popular object detector (Faster R-CNN) with training from scratch with a state-of-the-art pedestrian detector (Pedestron) fine-tuned on a pedestrian dataset. An investigation is launched to demonstrate that the predictions can be affected by different non-maximum suppression approaches (NMS and soft-NMS) for post-processing. These detection experiments' results show satisfactory performance of detecting foul subjects and objects. ☐ Furthermore, we detect foul participants and identify foul subjects and objects on video clips in a cluttered visual environment. Our system can differentiate foul participants from bystanders with high accuracy and localize them in a wide range of game situations. We also report reasonable accuracy for distinguishing the player committing the foul, or subject, from the object of the infraction. We also experiment camera calibration and clustering approaches on filtered on-the-field persons' torsos to differentiate them by colors. Quantitative analysis showed that the clustering approach achieves good performance. ☐ Lastly, we build a neural network for sports spotting over the entire game video in an end-to-end manner by combining a state-of-the-art action recognition network (SlowFast) into the temporal action spots detection network. Instead of extracting and storing all video features for the temporal detection, we do end-to-end training and inference. This architecture would be more efficient for practical applications as it reduces the otherwise multiple steps of training and inference. By enhancing and organizing spatial and temporal contextual information extracted by action recognition parts on short video clips, future work of modifications on the neural network architecture can be done to improve the detection performance. We also would like to investigate attention-based methods for learning from the temporal distribution of action semantics to find significant features for action spotting in soccer game videos.
Description
Keywords
Action detection, Event detection, Video classification
Citation