Publications
This is a list of the publications I have been involved listed in reversed chronological order
2024
- A theory of appropriateness with applications to generative artificial intelligenceJoel Z Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P Agapiou, and 10 more authorsarXiv [cs.AI], Dec 2024
What is appropriateness? Humans navigate a multi-scale mosaic of interlocking notions of what is appropriate for different situations. We act one way with our friends, another with our family, and yet another in the office. Likewise for AI, appropriate behavior for a comedy-writing assistant is not the same as appropriate behavior for a customer-service representative. What determines which actions are appropriate in which contexts? And what causes these standards to change over time? Since all judgments of AI appropriateness are ultimately made by humans, we need to understand how appropriateness guides human decision making in order to properly evaluate AI decision making and improve it. This paper presents a theory of appropriateness: how it functions in human society, how it may be implemented in the brain, and what it means for responsible deployment of generative AI technology.
- Rethinking Teacher-Student Curriculum Learning through the Cooperative Mechanics of ExperienceManfred Diaz, Liam Paull, and Andrea TacchettiTransactions on Machine Learning Research (TMLR), Dec 2024
Teacher-Student Curriculum Learning (TSCL) is a curriculum learning framework that draws inspiration from human cultural transmission and learning. It involves a teacher algorithm shaping the learning process of a learner algorithm by exposing it to controlled experiences. Despite its success, understanding the conditions under which TSCL is effective remains challenging. In this paper, we propose a data-centric perspective to analyze the underlying mechanics of the teacher-student interactions in TSCL. We leverage cooperative game theory to describe how the composition of the set of experiences presented by the teacher to the learner, as well as their order, influences the performance of the curriculum that is found by TSCL approaches. To do so, we demonstrate that for every TSCL problem, an equivalent cooperative game exists, and several key components of the TSCL framework can be reinterpreted using game-theoretic principles. Through experiments covering supervised learning, reinforcement learning, and classical games, we estimate the cooperative values of experiences and use value-proportional curriculum mechanisms to construct curricula, even in cases where TSCL struggles. The framework and experimental setup we present in this work represents a novel foundation for a deeper exploration of TSCL, shedding light on its underlying mechanisms and providing insights into its broader applicability in machine learning.
- Soft Condorcet Optimization for Ranking of General AgentsMarc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, and 6 more authorsarXiv [cs.MA], Oct 2024
A common way to drive progress of AI models and agents is to compare their performance on standardized benchmarks. Comparing the performance of general agents requires aggregating their individual performances across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet’s original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.
- Milnor-Myerson Games and The Principles of Artificial Principal-Agent ProblemsManfred Diaz, Joel Z Leibo, and Liam PaullIn Finding The Frame: An RLC Workshop for Examining Conceptual Frameworks, Oct 2024
In this paper, we introduce Milnor-Myerson games, a multiplayer interaction structure at the core of machine learning (ML), to shed light on the fundamental principles and implications the artificial principal-agent problem has had in landmark ML results like AlphaGo and large language models (LLMs).
2022
- Generalization Games for Reinforcement LearningManfred Diaz, Charlie Gauthier, Glen Berseth, and Liam PaullIn ICLR 2022 Workshop on Gamification and Multiagent Solutions, Oct 2022
In reinforcement learning (RL), the term generalization has either denoted introducing function approximation to reduce the intractability of large state and action spaces problems or designated RL agents’ ability to transfer learned experiences to one or more evaluation tasks. Recently, many subfields have emerged to understand how distributions of training tasks affect an RL agent’s performance in unseen environments. While the field is extensive and ever-growing, recent research has underlined that variability among the different approaches is not as significant. We leverage this intuition to demonstrate how current methods for generalization in RL are specializations of a general framework. We obtain the fundamental aspects of this formulation by rebuilding a Markov Decision Process (MDP) from the ground up by resurfacing the game-theoretic framework of games against nature. The two-player game that arises from considering nature as a complete player in this formulation explains how existing methods rely on learned and randomized dynamics and initial state distributions. We develop this result further by drawing inspiration from mechanism design theory to introduce the role of a principal as a third player that can modify the payoff functions of the decision-making agent and nature. The games induced by playing against the principal extend our framework to explain how learned and randomized reward functions induce generalization in RL agents. The main contribution of our work is the complete description of the Generalization Games for Reinforcement Learning, a multiagent, multiplayer, game-theoretic formal approach to study generalization methods in RL. We offer a preliminary ablation experiment of the different components of the framework. We demonstrate that a more simplified composition of the objectives that we introduce for each player leads to comparable, and in some cases superior, zero-shot generalization compared to state-of-the-art methods, all while requiring almost two orders of magnitude fewer samples.
2021
- Braxlines: Fast and Interactive Toolkit for RL-driven Behavior Engineering beyond Reward MaximizationShixiang Shane Gu, Manfred Diaz, Daniel C. Freeman, Hiroki Furuta, and 6 more authorsOct 2021
The goal of continuous control is to synthesize desired behaviors. In reinforcement learning (RL)-driven approaches, this is often accomplished through careful task reward engineering for efficient exploration and running an off-the-shelf RL algorithm. While reward maximization is at the core of RL, reward engineering is not the only – sometimes nor the easiest – way for specifying complex behaviors. In this paper, we introduce \braxlines, a toolkit for fast and interactive RL-driven behavior generation beyond simple reward maximization that includes Composer, a programmatic API for generating continuous control environments, and set of stable and well-tested baselines for two families of algorithms – mutual information maximization (MiMax) and divergence minimization (DMin) – supporting unsupervised skill learning and distribution sketching as other modes of behavior specification. In addition, we discuss how to standardize metrics for evaluating these algorithms, which can no longer rely on simple reward maximization. Our implementations build on a hardware-accelerated Brax simulator in Jax with minimal modifications, enabling behavior synthesis within minutes of training. We hope Braxlines can serve as an interactive toolkit for rapid creation and testing of environments and behaviors, empowering explosions of future benchmark designs and new modes of RL-driven behavior generation and their algorithmic research.
- Uncertainty-Aware Policy Sampling and Mixing for Safe Interactive Imitation LearningManfred Diaz, Thomas Fevens, and Liam PaullIn 2021 18th Conference on Robots and Vision (CRV), Oct 2021
One of the many challenges faced by visually impaired (VI) individuals is the crossing of intersections while remaining within the crosswalk. We present a Learning from Demonstration (LfD) approach to tackle this problem and provide VI users with an assistive agent. Contrary to previous methods, our solution does not presume the existence of particular features in crosswalks. The application of the LfD framework helped us transfer sighted individuals’ abilities to the intelligent assistive agent. Our proposed approach started from a collection of 215 demonstrative videos of intersection crossings executed by sighted individuals (" the experts"). We labeled the video frames to gather the experts’ recommended actions, and then applied a policy derivation technique to extract the optimal behavior using state-of-the-art Convolutional Neural Networks. Finally, to assess the feasibility of such a solution, we evaluated the performance of the trained agent in predicting expert actions.
- LOCO: Adaptive exploration in reinforcement learning via local estimation of contraction coefficientsManfred Diaz, Liam Paull, and Pablo Samuel CastroIn Self-Supervision for Reinforcement Learning Workshop - ICLR 2021, Oct 2021
We offer a novel approach to balance exploration and exploitation in reinforcement learning (RL). To do so, we characterize an environment’s exploration difficulty via the Second Largest Eigenvalue Modulus (SLEM) of the Markov chain induced by uniform stochastic behaviour. Specifically, we investigate the connection of state-space coverage with the SLEM of this Markov chain and use the theory of contraction coefficients to derive estimates of this eigenvalue of interest. Furthermore, we introduce a method for estimating the contraction coefficients on a local level and leverage it to design a novel exploration algorithm. We evaluate our algorithm on a series of GridWorld tasks of varying sizes and complexity.
2020
- Active Domain RandomizationBhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J. Pal, and 1 more authorIn Proceedings of the Conference on Robot Learning, Oct 2020
Domain randomization is a popular technique for improving domain transfer, often used in a zero-shot setting when the target domain is unknown or cannot easily be used for training. In this work, we empirically examine the effects of domain randomization on agent generalization. Our experiments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters. We propose Active Domain Randomization, a novel algorithm that learns a parameter sampling strategy. Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization. Our experiments across various physics-based simulated and real-robot tasks show that this enhancement leads to more robust, consistent policies.
- The AI Driving Olympics at NeurIPS 2018Julian Zilly, Jacopo Tani, Breandan Considine, Bhairav Mehta, and 13 more authorsIn The NeurIPS ’18 Competition, Oct 2020
Despite recent breakthroughs, the ability of deep learning and reinforcement learning to outperform traditional approaches to control physically embodied robotic agents remains largely unproven. To help bridge this gap, we present the “AI Driving Olympics” (AI-DO), a competition with the objective of evaluating the state of the art in machine learning and artificial intelligence for mobile robotics. Based on the simple and well-specified autonomous driving and navigation environment called “Duckietown,” the AI-DO includes a series of tasks of increasing complexity—from simple lane-following to fleet management. For each task, we provide tools for competitors to use in the form of simulators, logs, code templates, baseline implementations and low-cost access to robotic hardware. We evaluate submissions in simulation online, on standardized hardware environments, and finally at the competition event. The first AI-DO, AI-DO 1, occurred at the Neural Information Processing Systems (NeurIPS) conference in December 2018. In this paper we will describe the AI-DO 1 including the motivation and design objections, the challenges, the provided infrastructure, an overview of the approaches of the top submissions, and a frank assessment of what worked well as well as what needs improvement. The results of AI-DO 1 highlight the need for better benchmarks, which are lacking in robotics, as well as improved mechanisms to bridge the gap between simulation and reality.
2017
- To Veer or Not to Veer: Learning From Experts How to Stay Within the CrosswalkManfred Diaz, Roger Girgis, Thomas Fevens, and Jeremy CooperstockIn Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017
One of the many challenges faced by visually impaired (VI) individuals is the crossing of intersections while remaining within the crosswalk. We present a Learning from Demonstration (LfD) approach to tackle this problem and provide VI users with an assistive agent. Contrary to previous methods, our solution does not presume the existence of particular features in crosswalks. The application of the LfD framework helped us transfer sighted individuals’ abilities to the intelligent assistive agent. Our proposed approach started from a collection of 215 demonstrative videos of intersection crossings executed by sighted individuals (”the experts”). We labeled the video frames to gather the experts’ recommended actions, and then applied a policy derivation technique to extract the optimal behavior using state-of-the-art Convolutional Neural Networks. Finally, to assess the feasibility of such a solution, we eval- uated the performance of the trained agent in predicting expert actions.