Confronting the privacy leak epidemic in the machine learning era
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As machine learning becomes increasingly integrated into modern digital infrastructure and mobile applications, concerns about user data privacy have grown significantly. Advanced ML models frequently rely on sensitive personal data to deliver intelligent and personalized services. However, this reliance also introduces serious risks: user data may be collected without consent, misused during model training, or leaked through model predictions. Addressing these threats requires a comprehensive understanding of how privacy risks manifest across the machine learning pipeline. This dissertation presents a systematic investigation into three critical dimensions of privacy in ML systems. First, we identify and characterize privacy leakage sources in both application-layer services and model behaviors. We analyze the mobile notification ecosystem as an overlooked but pervasive channel for covert data harvesting, and we introduce a novel self-comparison membership inference attack to expose how trained models reveal information about their training datasets. Second, we develop mechanisms for detecting unauthorized data usage. We propose new inference-based auditing techniques for semi-supervised models and introduce a non-intrusive, information-theoretic framework for dataset-level auditing in already trained models. Third, we design defense strategies to prevent privacy leakage, focusing on link inference attacks in Graph Neural Networks. We develop a structure-aware defense that obfuscates graph topology while preserving model utility. Collectively, this dissertation offers a unified view of data privacy risks in ML services. It contributes both empirical techniques and theoretical foundations for identifying leakage, detecting misuse, and defending against privacy threats, laying the groundwork for building secure and accountable machine learning systems.
Description
Keywords
Data privacy, Machine learning, Modern digital infrastructure