Topic 1: Information-Theoretic Foundations of Learning Algorithms
Wednesday 19 May Schedule
13:45 Opening of the Symposium
Information-theoretic foundations of learning algorithms
- 17:30: Keynote talk by Andrea Montanari (Stanford University) (Please note the change in the time of the keynote talk)
- Title: The generalization behavior of random feature and neural tangent models
Modern machine learning methods --most noticeably multi-layer neural networks-- require to fit highly non-linear models comprising tens of thousands to millions of parameters. However, little attention is paid to the regularization mechanism to control model's complexity and the resulting models are often so complex as to achieve vanishing training error. Despite this, these models generalize well to unseen data: they have small test error. I will discuss several examples of this phenomenon, leading to two-layer neural networks in the so-called lazy regime. For these examples precise asymptotics could be determined mathematically, using tools from random matrix theory, and a unifying picture is emerging. A common feature is the fact that a complex unregularized nonlinear model becomes essentially equivalent to a simpler model, which is however regularized in a non-trivial way. I will discuss the limitations of this linear theory and recent work aimed at learning richer function classes using random feature methods.
[Based on joint papers with: Michael Celentano, Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Feng Ruan, Youngtak Sohn, Jun Yan, Yiqiao Zhong]
- 14:00: Invited talk by Emmanuel Abbe (EPFL)
- Title: Fundamental limits of differentiable learning
Deep learning has put forward the power of training large differentiable models (neural networks) with gradient-based optimization. Can we quantify the power of such a general differentiable learning framework? Is it as powerful as PAC learning, or statistical query learning? Or is it not much more powerful than kernel learning? We will revise these models, formalize the questions and give various elements of answers. Based on joint works with Colin Sandon, Pritish Kamath, Eran Malach and Nati Srebro.
- 14:20: Invited talk by Max Raginsky (University of Illinois at Urbana-Champaign)
- Title: On some Information-Theoretic Aspects of Generative Adversarial Models
The term ‘probabilistic generative model’ refers to any process by which a sample from a target probability measure on some high-dimensional space is produced by applying a deterministic transformation to a sample from a fixed probability measure on some latent space. Generative adversarial models have been proposed recently as a methodology for learning this transformation (referred to as the generator) on the basis of samples from the target measure, where the goodness of the generator is assessed by another deterministic transformation (the discriminator). Both the generator and the discriminator are learned jointly, in the min-max fashion, where the generator attempts to minimize some empirical measure of closeness between the target and the model, while the discriminator attempts to optimally distinguish between the two. In this talk, I will show that one can examine the capabilities of such models through an information-theoretic lens: Consider a binary-input channel, where the transmission of 0 produces a sample from the target measure, while the transmission of 1 produces a sample from the generator, and the discriminator acts as a decoder. One can then characterize the capabilities of the generator using information-theoretic converses, while the performance of the discriminator can be quantified using achievability arguments. I will present several illustrations of the utility of this approach in the context of generative adversarial nets (GANs).
- 14:40: Invited talk by Giuseppe Durisi (Chalmers University of Technology)
- Title: Fast-Rate Loss Bounds via Conditional Information Measures with Applications to Neural Networks
We present a framework to derive bounds on the test loss of randomized learning algorithms for the case of bounded loss functions. Drawing from Steinke & Zakynthinou (2020), this framework leads to bounds that depend on the conditional information density between the the output hypothesis and the choice of the training set, given a larger set of data samples from which the training set is formed. Furthermore, the bounds pertain to the average test loss as well as to its tail probability, both for the PAC-Bayesian and the single-draw settings. If the conditional information density is bounded uniformly in the size n of the training set, our bounds decay as 1/n. This is in contrast with the tail bounds involving conditional information measures available in the literature, which have a less benign dependence on n. We demonstrate the usefulness of our tail bounds by showing that they lead to nonvacuous estimates of the test loss achievable with some neural network architectures trained on MNIST and Fashion-MNIST.
- 15:00: Invited talk by Gabor Lugosi (Universitat Pompeu Fabra)
- Title: Learning the structure of graphical models by covariance queries
The dependence structure of high-dimensional distributions is often modeled by graphical models. The problem of learning the graph underlying such distributions has received a lot of attention in statistics and machine learning. In problems of very high dimension, it is often too costly even to store the sample covariance matrix. We propose a new model in which one can query single entries of the covariance matrix. We propose computationally efficient algorithms for structure recovery in Gaussian graphical models with computational complexity that is quasi-linear in the dimension. We present algorithms that work for trees and, more generally, for graphs of small treewidth. The talk is based on joint work with Jakub Truszkowski, Vasiliki Velona, and Piotr Zwiernik.
- 15:30-16:00: Break
16:00: Invited Session on Information-Theoretic Measures in Learning
Organized by: Lav Varshney (University of Illinois at Urbana-Champaign)
- 16:00: Invited talk by Ayfer Özgür (Stanford University)
- Title: Information Constrained Optimal Transport: From Relay to Generative Adversarial Networks
In this talk, we will discuss the notion of information constrained Wasserstein distance. We will present an upper bound on this quantity that corresponds to a generalization and strengthening of a celebrated inequality by Talagrand and show how it leads to the solution of a problem on the relay channel posed by Cover in 1980's. We will then discuss how the information constraint can be used as a regularizer for training Wasserstein GANs. We will characterize the impact of the regularization on the optimal solution in a benchmark setting and show that it improves the quality of the generator learned from empirical data by removing the curse of dimensionality.
- 16:20: Invited talk by Haizi Yu (University of Chicago)
- Title: Information Lattice Learning: To Learn Rules and Concepts Like What We Naturally Do
Can machines learn the laws of music theory directly from raw sheet music, and further in the same human-interpretable form as in a textbook? How little prior knowledge and how little data is needed, in order to reproduce what we know, reinterpret what we know, and even discover what we don’t? We address questions like these by developing a general learning framework, called Information Lattice Learning (ILL), to learn rules and concepts akin to those humans distill, typically through abstractions from only a few examples. The key idea is manifest in our method of computational abstraction—a group-theoretic and information-theoretic way of building lattices of partitions and lattices of information to mimic human abstraction processes and hierarchies. This idea generalizes Shannon’s lattice theory of information discussed at the first London Symposium on Information Theory (LSIT, 1950) and further brings statistical learning into the lattice. Compared to black-box and big-data-driven models, ILL focuses on explainability and generalizability from “small data”: it addresses the core question “what makes X an X” to create effective, rule-based explanations designed to help humans understand and learn. Besides achieving human-interpretable rules, we make the learning process itself mimic a human learner, manifest in ILL’s two phases—lattice construction and learning. The former is prior-driven and prior-efficient by building in only universal priors that resemble human innate cognition (e.g., those from Core Knowledge). The latter is data-driven and data-efficient by requiring only “small data” where abstraction is key to “make small data large”. The whole process echoes the idea of “starting like a baby and learning like a child”, threading the needle between a full rule-based model and a full data-driven model. We illustrate ILL applications by learning music theory from scores, chemical laws from molecular databases, and demonstrate how ILL can further discover mutual interpretations bridging knowledge from contrasting topic domains (e.g., casting music theory as chemical laws and vice versa). We also include initial benchmarks and assessments for ILL, where we show how ILL reproduces known knowledge, how interpretable it is, as well as new knowledge discovery.
- 16:40: Invited talk by Huan Wang (Salesforce Research)
- Title: An Efficient Bayes Estimator for Classification Task Hardness Evaluation
Evaluating the internal hardness of data sets and learning tasks across domains regardless of models is important for measuring model performance and the progress of the field. We propose an unbiased estimator for the Bayes error as a natural measure of the hardness of the learning task. For any given data set and learning task, we first find a closest approximation of the data distribution based on the normalizing flow technique. We prove the Bayes error stays invariant under the invertible map introduced by the normalizing flow. On the approximated distribution, the Bayes error can be calculated exactly and efficiently using Holmes-Diaconis-Ross based methods. We evaluate our estimator on both simulated and real-world data sets and show the estimated Bayes errors align well with the performance of state-of-the-art AI models.
- 17:00: Invited talk by Rishabh Iyer (University of Texas at Dallas)
- Title: Submodular Information Measures with Applications to Targeted Data Subset Selection and Data Summarization
Information-theoretic quantities like entropy and mutual information have found numerous uses in machine learning. It is well known that there is a strong connection between these entropic quantities and submodularity since entropy over a set of random variables is submodular. In this talk, we will study combinatorial information measures that generalize independence, (conditional) entropy, (conditional) mutual information, and total correlation defined over sets of (not necessarily random) variables. These measures strictly generalize the corresponding entropic measures since they are all parameterized via submodular functions that strictly generalize entropy. We study specific instantiations of the submodular information measures (for a number of submodular functions like set cover, facility location, graph cuts, log determinants etc.), and study properties of these measures such as submodularity, monotonicity, upper/lower bounds etc.
With increasing data, techniques for finding smaller yet effective subsets with specific characteristics become important. We demonstrate the utility of the above submodular information measures on two concrete applications: a) improve a supervised model's performance at a given additional labeling cost by targeted subset selection where a subset of unlabeled points matching a target set are added to the training set, and b) a more nuanced targeted summarization where data (e.g., image collections, text or videos) is summarized for quicker human consumption with additional user intent such as queries, privacy constraints, irrelevance and updates. We show that a number of previously studied approaches for both data summarization and data subset selection with targets (e.g. query sets, targeted slices or privacy constraints) are in fact special cases of our general framework, and through extensive experiments on image classification and image summarization, we empirically demonstrate the utility of our methods. Concretely, we see that using the submodular mutual information functions, we observe approximately 30% gain over the model's performance before re-training with added targeted subset, and 12% gain compared to other existing approaches for this problem.
This is based on joint work with Vishal Kaushal, Suraj Kothawade, Himanshu Asnani, Jeff Bilmes, and Ganesh Ramakrishnan.