Distance-based methods in Machine Learning

Distance-based methods represent a varied and extensively used set of techniques for performing statistical learning by minimising the distance or discrepancy between probability distributions. One key advantage of distance-based techniques is that the resulting model's properties are dependent on the underlying distance selected. Crafting distances that encode desirable properties, such as stability and robustness, is a promising area of research.

The workshop on will cover a broad range of statistical and machine learning methods, including but not limited to

parameter estimation,
generalised Bayes,
hypothesis testing,
optimal transport,

which are based on statistical distances such as the Maximum Mean Discrepancy (MMD), Kernel Stein Discrepancy (KSD), score matching, Wasserstein distances, Sinkhorn divergences, Kullback-Leibler (KL) divergence, and more.

~~Registration and submissions open~~	8 May 2023
~~Submissions close~~	1 June 2023
~~Author notifications~~	9 June 2023
~~Registration closes~~	23 June 2023
~~Workshop~~	27-28 June 2023

Programme

Tuesday 27 June

10:00–10:30	☕ Registration and morning coffee
10:30–10:45	👋 Welcome from the organisers
10:45–11:30	Scaling up kernel-based structured output prediction with low-rank approaches. Florence d'Alché-Buc (Télécom Paris, Institut Polytechnique de Paris) [show abstract] Surrogate regression methods offer a powerful and flexible solution to structured output prediction by embedding the output in a Hilbert space. In this talk we mainly focus on one the simplest and oldest surrogate approach that leverages kernels in the input space as well as in the output space. While enjoying strong statistical guarantees, these surrogate kernel methods require important computations, for training as well as for inference. We propose to re-visit them by applying low-rank projections in the input and output feature spaces to reduce the complexity. Low-rank projection operators based on sketching are presented and the statistical properties of the resulting novel estimator are studied in terms of excess risk bounds. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. In conclusion, we identify other surrogate models based on other losses/distances where this approach could be relevant as well.
11:30–12:15	Generalised Bayesian Inference for Intractable Likelihoods. Takuo Matsubara (The Alan Turing Institute & Newcastle University) [show abstract] Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.
12:15–13:30	🥗 Lunch
13:30-14:15	Label Shift Quantification via Distribution Feature Matching. Badr-Eddine Chérief-Abdellatif (Sorbonne Université & Université Paris Cité) [show abstract] Quantification learning deals with the task of estimating the target label distribution under label shift. In this talk, we present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures and extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. We also illustrate the theoretical results with a numerical study.
14:15-14:35	Robust Empirical Bayes for Gaussian Processes. Masha Naslidnyk (University College London) [video] [show abstract] For many contemporary statistical machine learning problems, model misspecification is pervasive and impactful. In particular, inferences are not robust, and uncertainty quantification becomes brittle. These issues are particularly affecting nonparametric models like Gaussian processes, exacerbated by the paradigm shift in Bayesian inference from bespoke models and small data regimes to black-box settings and increasingly large and under-curated datasets. While distance-based estimation is a powerful remedy for this setting, previously proposed distances between conditional distributions are intractable. To resolve this, we introduce a computationally tractable distance on the space of conditional probability distributions we call expected maximum conditional mean discrepancy. The theoretical properties of the resulting distance-based estimator are investigated in detail. While the estimator is of general interest, we focus on its application as a robust empirical Bayes estimator in Gaussian Process models. Specifically, we demonstrate that it produces reliable uncertainty quantification for regression problems, computer model emulation, and Bayesian optimisation.
14:35-14:55	Composite Goodness-of-Fit Tests with Kernels. Oscar Key (University College London) [video] [show abstract] Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this talk I will discuss how this question can be answered using composite goodness-of-fit tests, which check whether the data comes from any distribution in some parametric family. I will introduce our two kernel-based implementations of this test, based on the maximum mean discrepancy and kernel Stein discrepancy. I will preview our main result from the paper: that we are able to both estimate the parameter and conduct the test on the same data, without splitting, while maintaining a correct test level.
14:55-15:15	MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting. Antonin Schrab (University College London) [show abstract] We propose novel statistics which maximise the power of a two-sample test based on the Maximum Mean Discrepancy (MMD), by adapting over the set of kernels used in defining it. For finite sets, this reduces to combining (normalised) MMD values under each of these kernels via a weighted soft maximum. Exponential concentration bounds are proved for our proposed statistics under the null and alternative. We further show how these kernels can be chosen in a data-dependent but permutation-independent way, in a well-calibrated test, avoiding data splitting. This technique applies more broadly to general permutation-based MMD testing, and includes the use of deep kernels with features learnt using unsupervised models such as auto-encoders. We highlight the applicability of our MMD-FUSE test on both synthetic low-dimensional and real-world high-dimensional data, and compare its performance in terms of power against current state-of-the-art kernel tests.
15:15-15:45	☕ Afternoon coffee
15:45-16:15	🪧 Poster session
16:15-17:00	Variational Gradient Descent using Local Linear Models. Song Liu (University of Bristol) [video] [show abstract] Stein Variational Gradient Descent (SVGD) can transport particles along trajectories that reduce the KL divergence between the target and particle distribution but requires the target score function to compute the update. We introduce a new perspective on SVGD that views it as a local estimator of the reversed KL gradient flow. This perspective inspires us to propose new estimators that use local linear models to achieve the same purpose. The proposed estimators can be computed using only samples from the target and particle distribution without needing the target score function. Our proposed variational gradient estimators utilize local linear models, resulting in computational simplicity while maintaining effectiveness comparable to SVGD in terms of estimation biases. Additionally, we demonstrate that under a mild assumption, the estimation of high-dimensional gradient flow can be translated into a lower-dimensional estimation problem, leading to improved estimation accuracy. We showcase our proposed estimator by transporting non-smiling images in celebA dataset to mimic the distribution of smiling images. The resulting algorithm (dubbed SmileVGD) shows promising performance when transporting images.
17:00-17:45	Convergence control with kernel Stein discrepancies. Alessandro Barp (University of Cambridge) [video] [show abstract] Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this talk we discuss the geometry of KSDs and derive sufficient and necessary conditions to ensure (i) and (ii) hold. In particular we obtain the first KSDs known to exactly metrize weak convergence to P. We highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.

Wednesday 28 June

10:00–10:30	☕ Morning coffee
10:30–11:15	Using Stein characterisations of network models for goodness-of-fit and data generation. Gesine Reinert (University of Oxford) [video] [show abstract] Synthetic data are increasingly used in statistics and machine learning. A particularly challenging type of data are graphs, to represent complex dependence structures. Hence methods for generating synthetic data, and for assessing their quality, are much in demand. Distributions of random networks can be characterised using Stein’s method. This talk details how these characterisations can be put to use for assessing goodness of fit and also how to generate synthetic data.
11:15–12:00	Make Stein’s method great again for generative modelling? Yingzhen Li (Imperial College London) [video] [show abstract] My original motivation for studying Stein’s method for ML was to better train deep generative models. Indeed, methods based on Stein discrepancy — a clever way to address the intractability issues for score-matching — had initial success in training generative models back in a few years ago. But since late 2019, Fisher divergence & demonising methods for score-matching have taken off for large-scale deep generative models, which leads to what we know as score-based generative models (and diffusion models) today in the wave of Generative AI. So in this talk I’m going to speculate on what needs to be addressed if we want to make Stein’s method great again for training score-based generative models. It’s possible that Stein discrepancy is less well suited for this purpose, but at least we should try to understand why.
12:00–13:30	🥗 Lunch
13:30-13:50	Robust and Scalable Bayesian Online Changepoint Detection. Matías Altamirano (University College London) [video] [show abstract] We propose an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.
13:50-14:10	Stein Pi-Importance Sampling. Wilson Chen (University of Sydney) [video] [show abstract] Stein discrepancies have emerged as a powerful tool for retrospective improvement of Markov chain Monte Carlo output. However, the question of how to design Markov chains that are well-suited to such post-processing has yet to be addressed. This work studies Stein importance sampling, in which weights are assigned to the states visited by a Pi-invariant Markov chain to obtain a consistent approximation of P, the intended target. Surprisingly, the optimal choice of Pi is not identical to the target P; we therefore propose an explicit construction for Pi based on a novel variational argument. Explicit conditions for convergence of Stein Pi-Importance Sampling are established. For 70% of tasks in the PosteriorDB benchmark, a significant improvement over the analogous post-processing of P-invariant Markov chains is reported..
14:10-14:30	Stein’s Method for Erdös-Rényi Mixture Graph Models. Anum Fatima (University of Oxford) [video] [show abstract] In many real world networks the vertices are heterogeneous and the edge probabilities differ between pairs of vertices within the network. The Erdös-Rényi Mixture Graph (ERMG) models accommodate such heterogeneity by assuming that the vertices appear in blocks and the edge probabilities within each block and across the blocks depend on the block membership of the vertices involved. In an ERMG we assume that block membership and the edge probabilities, within and across blocks, are known for all vertices and for all blocks. In this work we derive a Stein equation for an ERMG model and obtain a bound on the solution of our Stein equation. Using our Stein equation and the bound, we bound a distance between an ERMG model and an Erdös-Rényi model, an Exponential Random Graph Model (ERGM) and other Erdös-Rényi Mixture Graph models. For an application of our bounds, we use Political blog data from Adamic & Glance (2004) and Florentine Marriage data to give numerical bounds. We are further exploring a goodness-of-fit test for the ERMG model using a kernel Stein discrepancy, which is a Monte Carlo test based on the simulated networks. Our test statistics are derived using a divergence constructed using Stein’s method and a discrete Stein operator for the ERMG model taking a reproducing kernel Hilbert space as our Stein class. Xu & Reinert (2021) used this technique to develop a goodness-of-fit test for ERGM.
14:30-14:50	Nonnegative Matrix Factorization in Wasserstein distances for source separation. Andersen Ang (University of Southampton) [show abstract] We consider single-channel (sc) audio blind source separtation (ABSS) via nonnegative matrix factorization (NMF), i.e., given a single measurement of mixed audio recording of multiple musical instruments, we wish to obtain the audio track of each indivudal instrument using NMF. sc-ABSS problem and solving sc-ABSS using NMF is not new, but existing works assume the data is "nice: the data is in a Euclidean space and the NMF method can be used. In this talk we focus on a more practical situation that, due to measurement error, the recording exhibits a misallignment in the frequency spectrum that makes traditional NMF fail. We address this problem using 1-dimensional Wasserstein transformation, and then casting the resulting model as an nonconvex non-smooth non-proximable optimizaiton problem and solved using block coordinate descent with proximal averaging.
14:50-15:15	☕ Afternoon coffee
15:15-16:00	Merging Rates of Opinions via Optimal Transport on Random Measures. Marta Catalano (University of Warwick) [show abstract] The Bayesian approach to inference is based on a coherent probabilistic framework that naturally leads to principled uncertainty quantification and prediction. Via conditional (or posterior) distributions, Bayesian nonparametric models make inference on parameters belonging to infinite-dimensional spaces, such as the space of probability distributions. The development of Bayesian nonparametrics has been triggered by the Dirichlet process, a nonparametric prior that allows one to learn the law of the observations through closed-form expressions. Still, its learning mechanism is often too simplistic and many generalizations have been proposed to increase its flexibility, a popular one being the class of normalized completely random measures. Here we investigate a simple yet fundamental matter: will a different prior actually guarantee a different learning outcome? To this end, we develop a new distance between completely random measures based on optimal transport, which provides an original framework for quantifying the similarity between posterior distributions (or merging of opinions). Our findings provide neat and interpretable insights on the impact of popular Bayesian nonparametric priors, avoiding the usual restrictive assumptions on the data-generating process.
16:00-16:45	Neural signature kernels as infinite-width limits of neural controlled differential equations. Cristopher Salvi (Imperial College London) [show abstract] Motivated by the paradigm of reservoir computing, I will consider randomly initialized neural controlled differential equations and show that in the infinite-width limit and under proper rescaling of the vector fields, these neural architectures converge weakly to Gaussian processes indexed on path-space and with covariances satisfying certain PDEs varying according to the choice of activation function. In the special case where the activation function is the identity, the equation reduces to a linear PDE and the limiting kernel agrees with the original signature kernel.