Distancebased methods represent a varied and extensively used set of techniques for performing statistical learning by minimising the distance or discrepancy between probability distributions. One key advantage of distancebased techniques is that the resulting model's properties are dependent on the underlying distance selected. Crafting distances that encode desirable properties, such as stability and robustness, is a promising area of research.
The workshop on will cover a broad range of statistical and machine learning methods, including but not limited to
 parameter estimation,
 generalised Bayes,
 hypothesis testing,
 optimal transport,
8 May 2023  
1 June 2023  
9 June 2023  
23 June 2023  
2728 June 2023 
Programme
Tuesday 27 June
10:00â10:30  â Registration and morning coffee  
10:30â10:45  đ Welcome from the organisers  
10:45â11:30  Scaling up kernelbased structured output prediction with lowrank approaches.
Florence d'AlchĂ©Buc (TĂ©lĂ©com Paris, Institut Polytechnique de Paris) [show abstract]
Surrogate regression methods offer a powerful and flexible solution to structured output prediction by embedding the output in a Hilbert space. In this talk we mainly focus on one the simplest and oldest surrogate approach that leverages kernels in the input space as well as in the output space. While enjoying strong statistical guarantees, these surrogate kernel methods require important computations, for training as well as for inference. We propose to revisit them by applying lowrank projections in the input and output feature spaces to reduce the complexity. Lowrank projection operators based on sketching are presented and the statistical properties of the resulting novel estimator are studied in terms of excess risk bounds.
From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. In conclusion, we identify other surrogate models based on other losses/distances where this approach could be relevant as well.


11:30â12:15  Generalised Bayesian Inference for Intractable Likelihoods.
Takuo Matsubara (The Alan Turing Institute & Newcastle University) [show abstract]
Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and biasrobustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernelbased exponential family models and nonGaussian graphical models.


12:15â13:30  đ„ Lunch  
13:3014:15  Label Shift Quantification via Distribution Feature Matching.
BadrEddine ChĂ©riefAbdellatif (Sorbonne UniversitĂ© & UniversitĂ© Paris CitĂ©) [show abstract]
Quantification learning deals with the task of estimating the target label distribution under label shift. In this talk, we present a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derive a general performance bound for DFM procedures and extend this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. We also illustrate the theoretical results with a numerical study.


14:1514:35  Robust Empirical Bayes for Gaussian Processes.
Masha Naslidnyk (University College London) [video] [show abstract]
For many contemporary statistical machine learning problems, model misspecification is pervasive and impactful.Â In particular, inferences are not robust, and uncertainty quantification becomes brittle. These issues are particularly affecting nonparametric models like Gaussian processes, exacerbated by the paradigm shift in Bayesian inference from bespoke models and small data regimes to blackbox settings and increasingly large and undercurated datasets. While distancebased estimation is a powerful remedy for this setting, previously proposed distances between conditional distributions are intractable. To resolve this, we introduce a computationally tractable distance on the space of conditional probability distributions we call expected maximum conditional mean discrepancy. The theoretical properties of the resulting distancebased estimator are investigated in detail. While the estimator is of general interest, we focus on its application as a robust empirical Bayes estimator in Gaussian Process models. Specifically, we demonstrate that it produces reliable uncertainty quantification for regression problems, computer model emulation, and Bayesian optimisation.


14:3514:55  Composite GoodnessofFit Tests with Kernels.
Oscar Key (University College London) [video] [show abstract]
Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this talk I will discuss how this question can be answered using composite goodnessoffit tests, which check whether the data comes from any distribution in some parametric family. I will introduce our two kernelbased implementations of this test, based on the maximum mean discrepancy and kernel Stein discrepancy. I will preview our main result from the paper: that we are able to both estimate the parameter and conduct the test on the same data, without splitting, while maintaining a correct test level.


14:5515:15  MMDFUSE: Learning and Combining Kernels for TwoSample Testing Without Data Splitting.
Antonin Schrab (University College London) [show abstract]
We propose novel statistics which maximise the power of a twosample test based on the Maximum Mean Discrepancy (MMD), by adapting over the set of kernels used in defining it. For finite sets, this reduces to combining (normalised) MMD values under each of these kernels via a weighted soft maximum. Exponential concentration bounds are proved for our proposed statistics under the null and alternative. We further show how these kernels can be chosen in a datadependent but permutationindependent way, in a wellcalibrated test, avoiding data splitting. This technique applies more broadly to general permutationbased MMD testing, and includes the use of deep kernels with features learnt using unsupervised models such as autoencoders. We highlight the applicability of our MMDFUSE test on both synthetic lowdimensional and realworld highdimensional data, and compare its performance in terms of power against current stateoftheart kernel tests.


15:1515:45  â Afternoon coffee  
15:4516:15  đȘ§ Poster session  
16:1517:00  Variational Gradient Descent using Local Linear Models.
Song Liu (University of Bristol) [video] [show abstract]
Stein Variational Gradient Descent (SVGD) can transport particles along trajectories that reduce the KL divergence between the target and particle distribution but requires the target score function to compute the update. We introduce a new perspective on SVGD that views it as a local estimator of the reversed KL gradient flow. This perspective inspires us to propose new estimators that use local linear models to achieve the same purpose. The proposed estimators can be computed using only samples from the target and particle distribution without needing the target score function. Our proposed variational gradient estimators utilize local linear models, resulting in computational simplicity while maintaining effectiveness comparable to SVGD in terms of estimation biases. Additionally, we demonstrate that under a mild assumption, the estimation of highdimensional gradient flow can be translated into a lowerdimensional estimation problem, leading to improved estimation accuracy. We showcase our proposed estimator by transporting nonsmiling images in celebA dataset to mimic the distribution of smiling images. The resulting algorithm (dubbed SmileVGD) shows promising performance when transporting images.


17:0017:45  Convergence control with kernel Stein discrepancies.
Alessandro Barp (University of Cambridge) [video] [show abstract]
Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernelbased discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this talk we discuss the geometry of KSDs and deriveÂ sufficient and necessary conditions to ensure (i) and (ii) hold. In particular we obtain the first KSDs known to exactly metrize weak convergence to P. We highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.

Wednesday 28 June
10:00â10:30  â Morning coffee 
10:30â11:15  Using Stein characterisations of network models for goodnessoffit and data generation.
Gesine Reinert (University of Oxford) [video] [show abstract]
Synthetic data are increasingly used in statistics and machine learning. A particularly challenging type of data Â are graphs, to Â represent complex dependence structures. Hence methods for generating synthetic data, and for assessing their quality, Â are much in demand. Distributions of random networks can be characterised using Steinâs method. This talk details how these characterisations can be put to use for assessing goodness of fit Â and also how to generate synthetic data.

11:15â12:00  Make Steinâs method great again for generative modelling?
Yingzhen Li (Imperial College London) [video] [show abstract]
My original motivation for studying Steinâs method for ML was to better train deep generative models. Indeed, methods based on Stein discrepancy â a clever way to address the intractability issues for scorematching â had initial success in training generative models back in a few years ago. But since late 2019, Fisher divergence & demonising methods for scorematching have taken off for largescale deep generative models, which leads to what we know as scorebased generative models (and diffusion models) today in the wave of Generative AI. So in this talk Iâm going to speculate on what needs to be addressed if we want to make Steinâs method great again for training scorebased generative models. Itâs possible that Stein discrepancy is less well suited for this purpose, but at least we should try to understand why.

12:00â13:30  đ„ Lunch 
13:3013:50  Robust and Scalable Bayesian Online Changepoint Detection.
MatĂas Altamirano (University College London) [video] [show abstract]
We propose an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.

13:5014:10  Stein PiImportance Sampling.
Wilson Chen (University of Sydney) [video] [show abstract]
Stein discrepancies have emerged as a powerful tool for retrospective improvement of Markov chain Monte Carlo output. However, the question of how to design Markov chains that are wellsuited to such postprocessing has yet to be addressed. This work studies Stein importance sampling, in which weights are assigned to the states visited by a Piinvariant Markov chain to obtain a consistent approximation of P, the intended target.
Surprisingly, the optimal choice of Pi is not identical to the target P; we therefore propose an explicit construction for Pi based on a novel variational argument. Explicit conditions for convergence of Stein PiImportance Sampling are established. For 70% of tasks in the PosteriorDB benchmark, a significant improvement over the analogous postprocessing of Pinvariant Markov chains is reported..

14:1014:30  Steinâs Method for ErdĂ¶sRĂ©nyi Mixture Graph Models.
Anum Fatima (University of Oxford) [video] [show abstract]
In many real world networks the vertices are heterogeneous and the edge probabilities differ between pairs of vertices within the network. The ErdĂ¶sRĂ©nyi Mixture Graph (ERMG) models accommodate such heterogeneity by assuming that the vertices appear in blocks and the edge probabilities within each block and across the blocks depend on the block membership of the vertices involved. In an ERMG we assume that block membership and the edge probabilities, within and across blocks, are known for all vertices and for all blocks.
In this work we derive a Stein equation for an ERMG model and obtain a bound on the solution of our Stein equation. Using our Stein equation and the bound, we bound a distance between an ERMG model and an ErdĂ¶sRĂ©nyi model, an Exponential Random Graph Model (ERGM) and other ErdĂ¶sRĂ©nyi Mixture Graph models. For an application of our bounds, we use Political blog data from Adamic & Glance (2004) and Florentine Marriage data to give numerical bounds.
We are further exploring a goodnessoffit test for the ERMG model using a kernel Stein discrepancy, which is a Monte Carlo test based on the simulated networks. Our test statistics are derived using a divergence constructed using Steinâs method and a discrete Stein operator for the ERMG model taking a reproducing kernel Hilbert space as our Stein class. Xu & Reinert (2021) used this technique to develop a goodnessoffit test for ERGM.

14:3014:50  Nonnegative Matrix Factorization in Wasserstein distances for source separation.
Andersen Ang (University of Southampton) [show abstract]
We consider singlechannel (sc) audio blind source separtation (ABSS) via nonnegative matrix factorization (NMF), i.e., given a single measurement of mixed audio recording of multiple musical instruments, we wish to obtain the audio track of each indivudal instrument using NMF. scABSS problem and solving scABSS using NMF is not new, but existing works assume the data is "nice: the data is in a Euclidean space and the NMF method can be used. In this talk we focus on a more practical situation that, due to measurement error, the recording exhibits a misallignment in the frequency spectrum that makes traditional NMF fail. We address this problem using 1dimensional Wasserstein transformation, and then casting the resulting model as an nonconvex nonsmooth nonproximable optimizaiton problem and solved using block coordinate descent with proximal averaging.

14:5015:15  â Afternoon coffee 
15:1516:00  Merging Rates of Opinions via Optimal Transport on Random Measures.
Marta Catalano (University of Warwick) [show abstract]
The Bayesian approach to inference is based on a coherent probabilistic framework that naturally leads to principled uncertainty quantification and prediction. Via conditional (or posterior) distributions, Bayesian nonparametric models make inference on parameters belonging to infinitedimensional spaces, such as the space of probability distributions. The development of Bayesian nonparametrics has been triggered by the Dirichlet process, a nonparametric prior that allows one to learn the law of the observations through closedform expressions. Still, its learning mechanism is often too simplistic and many generalizations have been proposed to increase its flexibility, a popular one being the class of normalized completely random measures. Here we investigate a simple yet fundamental matter: will a different prior actually guarantee a different learning outcome? To this end, we develop a new distance between completely random measures based on optimal transport, which provides an original framework for quantifying the similarity between posterior distributions (or merging of opinions). Our findings provide neat and interpretable insights on the impact of popular Bayesian nonparametric priors, avoiding the usual restrictive assumptions on the datagenerating process.

16:0016:45  Neural signature kernels as infinitewidth limits of neural controlled differential equations.
Cristopher Salvi (Imperial College London) [show abstract]
Motivated by the paradigm of reservoir computing, I will consider randomly initialized neural controlled differential equations and show that in the infinitewidth limit and under proper rescaling of the vector fields, these neural architectures converge weakly to Gaussian processes indexed on pathspace and with covariances satisfying certain PDEs varying according to the choice of activation function. In the special case where the activation function is the identity, the equation reduces to a linear PDE and the limiting kernel agrees with the original signature kernel.

Invited speakers

Alessandro Barp
University of Cambridge

BadrEddine ChĂ©riefAbdellatif
Sorbonne UniversitĂ© & UniversitĂ© Paris CitĂ©

Cristopher Salvi
Imperial College London

Florence d'AlchĂ©Buc
TĂ©lĂ©com Paris, Institut Polytechnique de Paris

Gesine Reinert
University of Oxford

Marta Catalano
University of Warwick

Song Liu
University of Bristol

Takuo Matsubara
The Alan Turing Institute & Newcastle University

Yingzhen Li
Imperial College London
Organisers

Masha Naslidnyk
Coorganiser

FranĂ§oisXavier Briol
Coorganiser

Oscar Key
Tech Officer

MatĂas Altamirano
Logistics Officer

Ilina Yozova
Venue Officer
We are very grateful for funding from the UCL Fellowship Incubator Fund, the UCL Department of Statistical Science, the UCL Institute for Mathematical and Statistical Sciences (IMSS), and UKRI CDT in Foundational AI funded by the Engineering and Physical Sciences Research Council [EP/S021566/1].
Location
LG26 Lecture Room, Bentham House, 48 Endsleigh Gardens, London, WC1H 0EG.