Magnitude, concentration, and metric complexity in phylogenetics and information theory

Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The results in this thesis contribute to two areas: phylogenetic inference and information theory. In the first part, we investigate some probabilistic models and problems that arise when one wants to estimate evolutionary parameters from large biological datasets. The classical setting for inference assumes that the number of independent sites, k, is much larger than the number of leaves, n, on a phylogenetic tree. In Chapter 2, we add to the list of examples of how estimation can go wrong when k is much smaller than n. We show that on the balanced claw tree under the binary non-symmetric model, the maximum likelihood estimator of evolutionary parameters is inconsistent when k = 1. ☐ To prove results about consistency in the classical case, concentration inequalities for independent random variables are used. However, when k << n, the dependencies between the observations cannot be ignored. So in Chapter 3, we show some McDiarmid-type concentration inequalities for leaf marginals of tree-indexed Markov measures and Bayesian Networks, which might be of interest to the machine learning community as well. ☐ Chapters 4 forms the bridge from phylogenetics to information theory, via the problem of ancestral state reconstruction. The existence of consistent estimators of the root state for discrete models is known to be equivalent to a density assumption near the root, called the Big Bang condition [12]. In the continuous case, the consistency of the Maximum Likelihood Estimate (MLE) based on observations Yn from a Brownian Motion (BM) model is known to be equivalent to the condition Magnitude (Vn) → +∞ where Vn is the covariance matrix of Yn. These geometric properties were shown to be equivalent in [8] using probabilistic tools. In this chapter, we reprove this equivalence in the language of magnitude [1], and lay a geometric framework for maximum likelihood estimation of the root state. ☐ The magnitude of a finite metric space is a quantification of its “effective” size [1]. An open problem in the theory of magnitude and diversity is that of finding an operational meaning for these quantities in information theory [4]. In the theory of lossless coding, a fundamental problem is that of quantifying the size of the smallest possible set of n-letter words from a discrete memoryless source with the largest possible likelihood. This establishes the lowest possible “cost” to be paid in order to communicate or compress n-sized strings of data with least possible error. The Asymptotic Equipartition Property (AEP) provides a concrete candidate for such a set, namely, the strongly or weakly typical sets [5]. These sets also give us an entry point for metric generalizations of the lossless coding problem to letters with a notion of distance between them. ☐ In Chapter 5, we lay the foundations for a metric generalization of the AEP in an effort to give operational meanings to magnitude and diversity. We show that for alphabets with two letters, the magnitude of type classes is approximately 2nHK(p), where HK(p) is the metric 1-complexity, which is the logarithm of the diversity of order 1. Methods to extend these results to alphabets of any size are also described, and in Chapter 6, we outline the requirements to prove a metric generalisation of the AEP, which will be completed in future work.
Description
Keywords
Brownian Motion, Maximum Likelihood Estimate, Phylogenetic inference, Information theory, Evolutionary parameters
Citation