Towards multi-scale inter-frame attention to improve deep learning tasks
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Access to specialized medical screening remains a challenge for individuals with sickle cell disease (SCD), particularly those in low-income and rural communities, where advanced diagnostic tools and expert evaluations are limited. In ophthalmology, Sickle Cell Retinopathy (SCR) diagnosis relies on ophthalmologic evaluation, including Optical Coherence Tomography (OCT) scans, but the manual interpretation is prone to subjectivity, fatigue-induced errors, and inconsistencies across clinicians. Similarly, video-based event analysis—such as reconstructing crime scenes from fragmented surveillance footage—is a time-intensive process that requires manual ordering and interpretation of unordered clips. These challenges highlight the need for automated solutions that enhance medical diagnostics and video-based decision-making. ☐ To address these issues, we propose Multi-scale Inter-frame Attention (MIA), a novel framework that enhances deep learning models for processing volumetric and video datasets. Our approach leverages spatial and spatio-temporal attention mechanisms to improve feature extraction and representation learning. We integrate MIA into two specialized models: the Cross-Scan Attention Transformer (CSAT) for SCR detection and the Sequential Ordering of Frames in Time (SOFT) for video-based action recognition. Experimental results demonstrate that CSAT+MIA outperforms conventional object detection models in diagnosing SCR, while SOFT+MIA enhances action recognition, particularly in temporally shuffled scenarios. ☐ Beyond domain-specific improvements, our research aims to establish a unified deep-learning method capable of capturing both inter-frame and intra-frame relationships for broader applications in medical imaging, surveillance, and video understanding. By integrating multi-scale inter-frame attention, we advance the field of automated diagnosis and event reconstruction, paving the way for more efficient, reliable, and intelligent decision-making systems.
Description
Keywords
Cross scan attention transformer, Sickle cell retinopathy, Video understanding, Sickle cell disease