Hybrid deep learning models for 2D computer vision

Author(s)Dulam, Rohit Venkata Sai
Date Accessioned2025-07-15T11:29:32Z
Date Available2025-07-15T11:29:32Z
Publication Date2025
SWORD Update2025-07-14T07:02:15Z
AbstractComputer vision has significantly evolved with the introduction of Vision Transformers (ViTs), which employ self-attention mechanisms to capture global context by considering relationships across the entire image. Previously, Convolutional Neural Networks (CNNs) were widely used, utilizing convolutional kernels to effectively capture hierarchical local spatial features such as textures, edges, and shapes. This thesis explores the integration of CNNs and ViTs into hybrid models to balance the local feature extraction capabilities of CNNs with the global context modeling abilities of ViTs, aiming to enhance performance in dense prediction tasks such as Salient Object Detection (SOD) and Edge/Contour Detection. ☐ This dissertation presents several innovative hybrid deep learning models that combine the strengths of CNNs and ViTs while overcoming their limitations. First, I present ConvSegFormer, a convolution-aided Vision Transformer that efficiently learns discriminative features from minimal data, competing well with state-of-the-art CNN architectures. Second, I present SODAWideNet, a hybrid deep learning model that incorporates self-attention into a convolutional architecture for SOD. It achieves competitive results without large-scale pre-training by using parallel attention and convolutional branches. Third, I present SODAWideNet++, a modified SODAWideNet model that merges the attention and convolution branches into a single branch, utilizing dilated convolutions for feature extraction. With a COCO-based pre-training pipeline, it delivers excellent performance with fewer trainable parameters. Fourth, I present SODDCNet, a CNN architecture that uses attention-generated convolutional weights to capture localized, instance-specific semantic features, achieving state-of-the-art results in SOD and video SOD. Fifth, I present USODFuseNet, an RGB-D SOD model that excels in underwater and RGB-D tasks, featuring a dual-branch encoder architecture with an efficient feature merging operation.
AdvisorKambhamettu, Chandra
DegreePh.D.
DepartmentUniversity of Delaware, Department of Computer and Information Sciences
Unique Identifier1528564402
URLhttps://udspace.udel.edu/handle/19716/36350
Languageen
PublisherUniversity of Delaware
URIhttps://www.proquest.com/pqdtlocal1006271/dissertations-theses/hybrid-deep-learning-models-2d-computer-vision/docview/3229758762/sem-2?accountid=10457
KeywordsSalient Object Detection
KeywordsConvolutional Neural Networks
KeywordsVision Transformers
KeywordsComputer vision
TitleHybrid deep learning models for 2D computer vision
TypeThesis
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Dulam_udel_0060D_16580.pdf
Size:
14.73 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: