Detecting distributional discrepancies using kernel landmarks
Date
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Slicing via one-dimensional linear or non-linear projections has emerged as a promising technique to enhance the computational and statistical efficacy of the Wasserstein distance. Our work contributes to this field by introducing a straightforward computational method for slicing the kernel Wasserstein distance, facilitating its application as an interpretable two-sample test. Our approach, termed max-sliced landmark Wasserstein distance (MLW), strategically selects a single point from the samples to serve as the support vector. We conduct experiments using adapted versions of the MNIST and CIFAR10 datasets, comparing the kernel landmark approach with the widely used maximum mean discrepancy (MMD) metric. The results show that MLW is statistically more powerful than MMD and is more interpretable, i.e., the landmark and points near it correspond to an imbalance in the distribution. Our investigation encompassed various distribution shift scenarios and examined the impact of different representation learning strategies. ☐ In order to handle multiple localized discrepancies, we introduce a novel metric termed Distributional Landmark Wasserstein distance (DLW) that addresses the shortcomings of previous sliced distance metrics by optimally distributing slices to strike a balance between discrepancy and diversity through an anti-concentration constraint. DLW is a robust probability metric with favorable statistical and computational properties. To optimize the distribution, we leverage the Frank-Wolfe algorithm with closed-form stepsize. Our experiments demonstrate DLW’s superiority over baseline methods in detecting multi-mode imbalances. Moreover, sampling from the landmark distribution highlights instances related to imbalances across multiple modes. ☐ Based on the fact that the means and covariances of kernel embeddings completely characterize distributions for certain kernels, we propose to explore how the Bures distance, a simplified form of the Wasserstein distance for zero-mean Gaussian distributions, can be used for efficient, interpretable divergences. Specifically, we propose and validate kernel approximations for the landmark sliced kernel Bures distance whose computational cost scales linearly with the sample size. We investigate various kernel approximation techniques, including random Fourier features and the recursive Nyström method, which reduces computation from quadratic in sample size to linear in sample size. Key contributions include the development of a scalable Distributional Landmark Bures (DLB) distance algorithm, with empirical validation, and consistent improvements over baseline methods like the MMD-witness function across diverse learning representations. ☐ Finally, we propose a greedy algorithm for identifying multiple landmarks using a deflation approach, Fast Landmark Bures Distance (FastLBD), which removes variation along the landmark slice. This novel deflation approach is able to detect discrepancies in datasets with an unknown number of imbalanced modes. The method iteratively identifies and deflates significant landmarks, eliminating the need for an anti-concentration term. We introduce a self-tuned variant that uses the “knee” method to determine the optimal number of landmarks for covering the distributional discrepancies. Empirical results show our approach outperforms traditional methods across various kernel approximation techniques and learning representations. This method offers a scalable and self-tuned solution for analyzing complex datasets with unknown mode distributions, improving multi-mode discrepancy detection.
Description
Keywords
Wasserstein distance, Maximum mean discrepancy, Distributional Landmark Wasserstein, Kernel approximation, Distributional Landmark Bures
