Designing information theoretic algorithms for improving deep learning generalization

Date
2022
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Deep learning has already achieved significant success in various applications, especially in the image recognition task. However, Deep Neural Networks (DNNs) commonly have generalization problems, namely that a well-performing DNN on training datasets may not generalize well on new data, specially when the distribution of new data differs to that of the training datasets. Consequently, the generalization problem constrains the applications of current deep learning techniques. ☐ Adversarial Data Augmentation (ADA) is one of the most popular approaches to improving deep learning generalization based on min–max optimization. The funda- mental idea of min–max optimization is that if a model can tackle the worst scenario then the model would achieve the best overall performance. However, the underlying problem is that the generic ADA, as a general implementation of min–max optimization, does not explicitly formulate the worst generalization scenario. As a result, fictitious samples augmented by ADA cannot guarantee forming the worst generalization scenario. Thus, the generic ADA is not the optimal solution to the generalization problem. ☐ The dissertation proposes an information theoretic algorithm to improve deep learning generalization in the context of supervised learning for image recognition. The main contributions of the dissertation are summarized in three aspects. ☐ First, the dissertation explores a mutual information trade-off to characterize the generalization behavior of DNNs. Specifically, the mutual information between training samples and the DNN representation of the input can be divided into two parts: (1) the mutual information between labels and the representation, and (2) the information irrelevant to labels learned by the DNN. Following the information bottleneck principle, if a DNN achieves a trade–off between the two information components, then the DNN achieves improved generalization. ☐ Second, the dissertation specifies an information theoretic algorithm to improve generalization. The mutual information trade–off indicates that the worst generalization scenario corresponds to the case when the DNN learns the minimal information about the labels, but a maximal amount of information irrelevant to labels. Based on the formulation of the worst generalization scenario, the dissertation regularizes ADA by the mutual information trade–off for encouraging ADA to generate the worse generalization scenario, where DNNs learns less label information necessary to achieve the image recognition task, but learns more irrelevant information, such as noise and perturbations. Based on min–max optimization, if ADA is regularized by the Mutual Information Trade-off (abbr. ADA–MIT), it would generate the worse generalization scenario and DNNs trained on worse generalization scenario would achieve better generalization. ☐ Third, the dissertation establishes an ad–hoc mutual information estimator for the ADA–MIT algorithm. The key of designing a feasible ADA–MIT algorithm is to quantify the generalization scenario via estimating (1) the mutual information between labels and DNNs and (2) the information irrelevant to labels learned by DNNs. To that end, the dissertation designs an ad–hoc mutual information estimator via studying the information flow in DNNs and proposing a probabilistic representation to explain the behavior of a hidden layer under some assumptions. In addition, the dissertation specifies the conditions for the ad–hoc estimator to derive accurate estimation and discusses the pros and cons of the ad–hoc estimator via comparing it to existing estimators. Notably, the dissertation highlights that although the ad–hoc estimator cannot guarantee accurate estimation in general situations, it is a convenient information theoretic tool for the ADA–MIT algorithm. ☐ Finally, the dissertation implements comprehensive experiments to examine the behavior of the ad–hoc estimator, validates the mutual information trade–off explanation for generalization, and demonstrates that ADA–MIT outperforms the state–of–the–art algorithms for improving deep learning generalization on multiple benchmark datasets and DNNs with different network architectures.
Description
Keywords
Deep neural networks, Deep learning, Adversarial data augmentation
Citation