Andrew M. Stuart FRS (Caltech): Processing Language, Images and Other Data Modalities.
Abstract: A fundamental problem in artificial intelligence is how to simultaneously deploy data from different sources — such as audio, images, text, and video — collectively known as multimodal data. In this talk, I will present a mathematical framework for studying this question, focusing primarily on text and images.
I will begin by describing how large language models
(LLMs) operate, addressing the challenging issue of using real-number algorithms to process language. In particular, I will explain next-token prediction — the core of current LLM methodology. I will then focus on the canonical problem of measuring alignment between image and text data (contrastive learning). Finally, I will describe how images can be generated from text prompts (conditional generative modeling).
From a mathematical perspective, a unifying theme underlying this work is the minimization of divergences defined on spaces of probability measures. A second key mathematical idea is the attention mechanism — a form of nonlinear correlation between vector-valued sequences.
I aim to explain these concepts — and their relevance to modern machine learning algorithms — in a broadly accessible fashion, suitable for a colloquium audience.
The talk will be followed by the reception in the Huxley Common Room.
.