Titles/Abstracts of Prior Projects in STAT234

Modular Reinforcement Learning for Mobile Health (Review paper)

Abstract: A common feature of reinforcement learning (RL) problems in the mobile health setting is the simultaneous pursuit of multiple subgoals. In particular, mobile health apps wish to assist the user in achieving health goals while also maintaining user engagement. Traditional approaches to reinforcement learning do not take advantage of this knowledge about the hierarchical structure of the problem. On the other hand, the framework of modular reinforcement learning (MRL) makes use of this structure by breaking down an RL problem into separate simultaneously-pursued subgoals, solving these subgoals, and obtaining a policy to balance these subgoals. In doing so, MRL may allow for faster learning and for better scientific understanding of the relationship between different subgoals. Given that multiple subgoals appear frequently in mobile health RL problems, it is natural to ask if MRL could be used for efficient learning in the mobile health setting. This paper investigates the application of MRL to mobile health: we review the major results of the MRL literature and for the first time, perform a rigorous comparison of modular RL methods on two newly implemented testbeds meant to provide analogues to the mobile health setting.

Artificial Trajectories for Off-Policy Learning

Abstract: Improving estimators for off-policy learning is an area of ongoing interest. The use of importance weights when evaluating a deterministic evaluation policy leads to a small effective sample size. We explore the use of artificial trajectories (Fonteneau et al., 2013) to increase effective sample size in off policy learning. We evaluate our method empirically on simple gridworlds and show a reduction in mean squared error using artificial trajectories compared to existing methods.

Improving Home Recommendations with Latent Preference Types and Thompson Sampling

Abstract: This paper develops an algorithm to provide high-quality recommendations to potential home buyers, avoiding the “cold start” problem and efficiently learning user preferences. The algorithm aggregates preferences across potential home buyers to develop user-preference types and then creates a probabilistic mapping between each user and each user-preference type.

The “cold start” problem, common in recommendation systems, is avoided by initializing this mapping for first-time users based on their location and/or initial search criteria and by initializing

a popularity score for each new property based on a preference-type-specific prediction model. New users are shown homes in proportion to the probability that they are a good fit, providing new users with high-quality recommendations while learning more about their preferences. Thompson Sampling is integrated into the site’s existing recommendation system to provide high-quality recommendations whose diversity reflects the system’s uncertainty about the quality of a new listing or its fit with a given user. The probabilistic nature of the approach is sufficiently flexible that users doing parallel searches (e.g. for a primary residence and a vacation home) will be shown homes relevant to both searches and users who are just browsing (e.g. to get a feel for the market before starting their home search) will be shown a wide swath of properties.

Quantitative Trading using Reinforcement Learning

Abstract: Prediction of securities’ prices is a difficult problem because a trader’s behaviors (buy, sell etc) interact with other traders in the same game and affect the price of the securities. Conventional supervised learning models can not model well this dynamic gaming nature of the trading process. This project aims to apply reinforcement learning (RL) models to improve trading strategies for investment. A number of RL models were tried and finally we decided to use a policy gradient (PG) algorithm [SB17] as our primary approach due to its advantage in convergence.

Inference for Bandit Algorithms with Pooling for Mobile Health

Abstract: For mobile health (mhealth) interventions to be effective, it's crucial that the activity nudges are sent at times when users would most benefit and most likely act on such suggestions \cite{nahumshani2017}. Learning when is the best time and context to send activity suggestions is difficult algorithms are primarily optimized \textit{online} and thus must learn quickly in order to prevent user burden from activity suggestions at inopportune times \cite{heckman2016}.

The two main paradigms for mhealth algorithms are (1) personalized algorithms that train one models per user and (2) population level algorithms that train a single model for all users. Since population level algorithms pool the data of multiple users together, when users are ``similar", this can lead to much faster learning compared to per-user models. One drawback of population level mhealth algorithms is that techniques for inference for these models are underdeveloped, which limits their use in high-stakes clinical trials. Moreover, although studies regarding population level mhealth algorithms have been performed \cite{yomtov2017}, they are not able to provide reliable confidence intervals for their treatment effect, which can limit their conclusiveness.

When performing inference on the treatment effect after the study is over, both individual and population level models must account for any time-varying effects; for example, in a smoking cessation study, mindfulness suggestions may only be effective when overall stress levels of the user are low \cite{boruvka2018}. However, while data collected through personalized algorithms can generally assume that the users' state/action/reward trajectories are independent, this is not the case for population level models, as pooling the data of multiple users together creates dependencies between the user trajectories through the shared algorithm. In this work, I begin to formalize and develop inference techniques for simplified pooled mhealth study settings.

Deep Reinforcement Learning for Setting Monetary Policy in a Simple Economy

Abstract: Central banks adjust the discount rate, or the interest rate for emergency loans from the federal government, in response to economic conditions in order to control inflation and promote economic growth. The Taylor rule is a common heuristic for setting the discount rate, and significant empirical research suggests that central banks frequently follow a policy close to the Taylor rule. In this project, we asked whether a deep reinforcement learning agent could adjust the discount rate in response to a wide range of economic data more effectively than the Taylor rule, particularly in situations where the Taylor rule is known to perform poorly. To test this, we developed a simple model of an economy based on Lengwiller et al. (2003) in which a central banker sets interest rates and the economy responds accordingly. We then used the deep Q learning algorithm with -greedy exploration, along with a number of small modifications, to predict the TD- targets as a function of the discount rate and set optimal discount rates. In most circumstances, deep Q monetary policy performed no worse than the Taylor rule. In several important scenarios in which the Taylor rule is known to perform poorly, deep Q monetary policy significantly outperformed the Taylor rule as measured by a growth-inflation reward function.

Adaptive Encouragement Selection for Field Experiments

Abstract: In this paper, we propose a Bayesian treatment of the follow-the-meta-leader algorithm developed in [Fin+19] that translates the Bayesian model-agnostic metalearning algorithm [Kim+18] to the online learning setting. By maintaining a posterior over the learned model parameters, our algorithm is able to capture model uncertainty online, and should yield improved generalization relative to the [Fin+19]’s algorithm through the regularizing effect of model priors. In addition to introducing our algorithm, we review related work and showcase results from re-implementing the follow-the-meta-leader algorithm that highlight the appeal of the online meta-learning paradigm.

Bayesian Online Meta-Learning

Abstract: Adaptive experimentation is used to discove optimal treatment regime given the choice of several. However, social science experimental researchers are traditionally less concerned with treatment optimization and most concerned with treatment inference. A particular challenge in such inference is non-compliance, which requires researchers to intuit how to best ‘encourage’ their treatment group into compliance. In this project, I propose a data-driven way of selecting optimal encouragements. In the context of multi-valued instruments (encouragements), I reframe the field experiment as a multi-armed bandit and suggest Thompson Sampling as an efficient method for maximizing compliance via allocating strong encouragements to treated units. I explain the details of my proposed adaptive encouragement methodology, present some initial simulation results and briefly describe work in progress to better understand, benchmark, and motivate adaptive encouragement selection.

State-Space Reduction in Deep Q-Networks

Abstract: Deep convolutional neural networks have become a popular approach to estimating Q-value functions in reinforcement learning problems. These deep Q-networks take in entire images as network inputs, often resulting in large state spaces and long learning times. In this paper, we explore the use of principal component analysis to reduce the state space of image inputs for small network sizes. After testing multiple network configurations, we determine that a reduction in uninformative state features through PCA helps improve the performance of deep reinforcement learning, particularly for small neural network architectures.

Transfer Learning for Atari games via Model-Based Reinforcement Learning

Abstract: Transfer learning is an active area of research because it offers interesting solutions to improve efficiency to learn a complex task when learning is data-intensive and computationally costly, and not enough data or computational resources are available. In the context of model-based Reinforcement Learning, we use the Atari games from the Arcade Learning Environment to quantify the benefit of pre-training a dynamics model for the environment on the source domain (a set of Atari games) to learn a new target task (a new Atari game). We investigate how the amount of target data available impacts the performance of the pre-trained model compared to a model trained from scratch.

Empirical Study of Negative Transfer

Abstract: Robust detection of non-transferable tasks can avoid significant performance decay during transfer learning (negative transfer), which is essential for safety critical domains. In this project, we identify conditions under which negative transfer may happen: (1) assumption mismatch, (2) out of distribution tasks. The latter condition is commonly studied in literature whereas the former is not. We presented two gridworld domains to demonstrate different types of negative transfer. We then compared transferability of two major approaches in transfer learning literature under a challenging framework wherein we only have batch data generated from optimal polices for previous tasks. Based on our results, we conclude that investigating tasks space prior to modeling can help avoid negative transfer.

Personalized Policy Characterization in Meta-Reinforcement Learning

Abstract: We build on the growing area of research related to generalizable reinforcement learning, including transfer, meta-, and few-shot learning, focusing on policy specific personalization to specific environments. The goal here is for models to personalize or adapt to individual types with distinguishable characteristics when faced with a population of possible agents. Our work is inspired by traditional meta-learning, where the goal is to train a model on a variety of tasks such that it can solve new tasks using only a small number of training samples, but is more closely motivated by real-world settings where policies may encounter a variety of environment dynamics and individualized responses to interventions. Accordingly, instead of a variety of tasks motivated by diverse reward functions, we consider differences across entire encountered MDPs. We introduce personalized environments to study such problems further, propose methods for tracking divergence and few-shot learning in the personalized setting, and demonstrate how models trained explicitly with respect to policy divergence can lead to favorable results in few-shot meta-learning.

Bayesian Online Meta-Learning

Abstract: In this paper, we propose a Bayesian treatment of the follow-the-meta-leader algorithm developed in [Fin+19] that translates the Bayesian model-agnostic meta learning algorithm [Kim+18] to the online learning setting. By maintaining a posterior over the learned model parameters, our algorithm is able to capture model uncertainty online, and should yield improved generalization relative to the [Fin+19]’s algorithm through the regularizing effect of model priors. In addition to introducing our algorithm, we review related work and showcase results from re-implementing the follow-the-meta-leader algorithm that highlight the appeal of the online meta-learning paradigm.

Deep Reinforcement Learning for Sepsis Treatment

Abstract: A long-standing goal of artificial intelligence in healthcare is an algorithm that learns human cognition and superhuman proficiency to assist clinical decision making in challenging medical domains. We develop a deep reinforcement learning (RL) agent with a new deep q-learning network (DQN) architecture, specifically designed for the time-sequential healthcare decision making problem. Our agent learns optimal management strategies in intensive care for sepsis, which is one of the most significant contributors to morbidity and mortality worldwide. We explore potential treatments based on a large amount of retrospective patient data and learned separately the medication effect of different severities (sepsis and shock). Our agent provides reliable high performance of values on a population level, give reasonable personalized dosage recommendations in treatment process of single patient and work out medically interpretable optimal policy. We show the consistent mortality reduction in patients with clinicians’ actions matching our suggestions to varying degrees.

New Directions for Applying RL to Educational Settings

Abstract: This paper attempts to summarize and identify areas of Reinforcement Learning (RL) which could be applied to problems of practice and evaluation in educational settings. It begins by identifying the ways educational settings are different from normal RL settings and the implications that those differences have for the application of RL. It continues by identifying two frontiers in RL that are particularly salient for educational research and practice and concludes by going into depth on a potential intervention that could leverage the power of RL to allow schools to better prevent and react to student absences.

Adaptive Encouragement Selection for Field Experiments

Abstract: Adaptive experimentation is used to discover optimal treatment regime given the choice of several. However, social science experimental researchers are traditionally less concerned with treatment optimization and most concerned with treatment inference. A particular challenge in such inference is non-compliance, which requires researchers to intuit how to best ‘encourage’ their treatment group into compliance. In this project, I propose a data-driven way of selecting optimal encouragements. In the context of multi-valued instruments (encouragements), I reframe the field experiment as a multi-armed bandit and suggest Thompson Sampling as an efficient method for maximizing compliance via allocating strong encouragements to treated units. I explain the details of my proposed adaptive encouragement methodology, present some initial simulation results and briefly describe work in progress to better understand, benchmark, and motivate adaptive encouragement selection.

Scalable Online Opponent Modelling in Imperfect Information Games

Abstract: Over the past decade, rapid progress has been made towards the solving of largescale imperfect information games including No-Limit Hold’em Poker [1] and Dota 2 [2], that were previously considered intractable. However, while rapid progress had been made towards solving imperfect information games, relatively little work has focused on the field of opponent modelling. As a result, much of the existing opponent modelling literature has focused on techniques that while effective on small imperfect information games, struggle to scale to large imperfect information games. This paper attempts to bridge this gap by proposing an imitation-learning based approach to opponent modelling in large-scale imperfect information games. We identify that the main challenge in scaling existing opponent modelling algorithms is an over-reliance on an exponentially large strategy space. To combat this, we propose using imitation-learning to distill a large strategy space into a smaller neural network parameter space. Our experiments on Leduc Poker, a small variant of Hold’em Poker, show that our imitation best-response model consistently achieves comparable results against the best Bayesian opponent modelling baselines while maintaining the flexibility to scale efficiently into large-scale games. An implementation of our algorithms is built on top of the popular PokerRL framework [3] and can be found on Github