Hybrid deep learning models for 2D computer vision

Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Computer vision has significantly evolved with the introduction of Vision Transformers (ViTs), which employ self-attention mechanisms to capture global context by considering relationships across the entire image. Previously, Convolutional Neural Networks (CNNs) were widely used, utilizing convolutional kernels to effectively capture hierarchical local spatial features such as textures, edges, and shapes. This thesis explores the integration of CNNs and ViTs into hybrid models to balance the local feature extraction capabilities of CNNs with the global context modeling abilities of ViTs, aiming to enhance performance in dense prediction tasks such as Salient Object Detection (SOD) and Edge/Contour Detection. ☐ This dissertation presents several innovative hybrid deep learning models that combine the strengths of CNNs and ViTs while overcoming their limitations. First, I present ConvSegFormer, a convolution-aided Vision Transformer that efficiently learns discriminative features from minimal data, competing well with state-of-the-art CNN architectures. Second, I present SODAWideNet, a hybrid deep learning model that incorporates self-attention into a convolutional architecture for SOD. It achieves competitive results without large-scale pre-training by using parallel attention and convolutional branches. Third, I present SODAWideNet++, a modified SODAWideNet model that merges the attention and convolution branches into a single branch, utilizing dilated convolutions for feature extraction. With a COCO-based pre-training pipeline, it delivers excellent performance with fewer trainable parameters. Fourth, I present SODDCNet, a CNN architecture that uses attention-generated convolutional weights to capture localized, instance-specific semantic features, achieving state-of-the-art results in SOD and video SOD. Fifth, I present USODFuseNet, an RGB-D SOD model that excels in underwater and RGB-D tasks, featuring a dual-branch encoder architecture with an efficient feature merging operation.
Description
Keywords
Salient Object Detection, Convolutional Neural Networks, Vision Transformers, Computer vision
Citation