Deep learning algorithms for image understanding based on multiple cues
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Image understanding is a broad and widely explored field of study, the goal is to learn different levels of abstractions from digital images or videos automatically as humans do. Examples of image understanding tasks include digit classification, object detection, object recognition, emotion recognition, social relationship recognition, group-level emotion recognition (GER), scene understanding, and event recognition. ☐ Understanding the meaning and content of images remains a challenging problem in computer vision, especially the tasks that require to extract high level abstractions. Therefore, this dissertation proposes to address this problem by designing deep learning algorithms that are able to integrate multiple cues. Special attention is given to the tasks of emotion recognition, social relationship recognition and event recognition. ☐ First of all, an efficient transfer learning-based deep neural network (DNN) is proposed as a baseline DNN for facial expression recognition (FER) in this dissertation. Compared to DNNs trained from scratch, the proposed DNN leverages the prior knowledge of human faces learned in the face recognition field. Specifically, since far less labeled data is available for training FER systems than for training face recognition models, a well-trained deep face recognition model is explored and fine-tuned for FER in the wild which leverages the large amount of labeled data from face recognition datasets. Experiments show that the proposed approach achieves improved state-of-the-art performance on the GENKI-4K dataset. ☐ Second, cues other than facial muscle movements are explored as important auxiliary information for comprehensive emotion recognition systems. Hybrid networks are proposed in this dissertation to address the multi-modal emotion recognition problem. A hybrid network that incorporates multiple cues, such as global scene features, skeleton features of the group, local facial features and visual attention features is proposed for GER. This network was submitted to the 2018 Emotion Recognition in the Wild Grand Challenge, and scored the first place in the GER sub-challenge. The proposed algorithm is also validated on one more GER dataset, one new proposed event recognition dataset and one more classic event recognition dataset to show its efficiency. ☐ Then, a Siamese model based on the proposed baseline DNN, which incorporates scene features, is developed for social relation recognition. Experimental results show that the proposed baseline DNN provides essential prior knowledge of faces, reduces computational pre-training time, and leads to outperforming state-of-the-art results. ☐ Finally, a graph neural network (GNN) for image understanding based on multiple cues is proposed in this dissertation. Compared to traditional feature and decision fusion approaches that neglect the fact that features can interact and exchange information, the proposed GNN is able to pass information among features extracted from different models. Two image understanding tasks, namely GER and event recognition, which are highly semantic and require the interaction of several deep models to synthesize multiple cues, were selected to validate the performance of the proposed method. It is shown through experiments that the proposed method achieves state-of-the-art performance on the selected image understanding tasks. In addition, a new group-level emotion recognition database is introduced and shared in this dissertation.
Description
Keywords
Computer Vision, Deep Learning, Graph Neural Networks, Image Understanding, Multiple Cues