Safe Reinforcement Learning by Imagining the Near Future

http://arxiv.org/abs/2202.07789v1

abstract: | Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-world problems, where suboptimal behaviors may lead to actual negative consequences. In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future. In this setting, a model-based agent with a sufficiently accurate model can avoid unsafe states. We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions. Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks. author:


Introduction

Reinforcement learning (RL) enables the discovery of effective policies for sequential decision-making tasks via trial and error [@mnih2015human; @gu2016deep; @bellemare2020autonomous]. However, in domains such as robotics, healthcare, and autonomous driving, certain kinds of mistakes pose danger to people and/or objects in the environment. Hence there is an emphasis on the safety of the policy, both at execution time and while interacting with the environment during learning. This issue, referred to as safe exploration, is considered an important problem in AI safety [@amodei2016concrete].

In this work, we advocate a model-based approach to safety, meaning that we estimate the dynamics of the system to be controlled and use the model for planning (or more accurately, policy improvement). The primary motivation for this is that a model-based method has the potential to anticipate safety violations before they occur. Often in real-world applications, the engineer has an idea of what states should be considered violations of safety: for example, a robot colliding rapidly with itself or surrounding objects, a car driving on the wrong side of the road, or a patient's blood glucose levels spiking. Yet model-free algorithms typically lack the ability to incorporate such prior knowledge and must encounter some safety violations before learning to avoid them.

An illustrative example. The agent controls the speed of a car by pressing the accelerator or brake (or neither), attempting to avoid any obstacles such as other cars or people in the road. The top car has not yet come into contact with the pedestrian, but cannot avoid the pedestrian from its current position and speed, even if it brakes immediately. The bottom car can slow down before hitting the pedestrian. If the bottom car plans several steps into the future, it could reduce its speed to avoid the "irrecoverable" situation faced by the top car.{#fig:didactic width="75%"}

We begin with the premise that in practice, forward prediction for relatively few timesteps is sufficient to avoid safety violations. Consider the illustrative example in Figure 1{reference-type="ref" reference="fig:didactic"}, in which an agent controls the acceleration (and thereby, speed) of a car by pressing the gas or brake (or nothing). Note that there is an upper bound on how far into the future the agent would have to plan to foresee and (if possible) avoid any collision, namely, the amount of time it takes to bring the car to a complete stop.

Assuming that the horizon required for detecting unsafe situations is not too large, we show how to construct a reward function with the property that an optimal policy will never incur a safety violation. A short prediction horizon is also beneficial for model-based RL, as the well-known issue of compounding error plagues long-horizon prediction [@asadi2019combating]: imperfect predictions are fed back into the model as inputs (possibly outside the distribution of inputs in the training data), leading to progressively worse accuracy as the prediction horizon increases.

Our main contribution is a model-based algorithm that utilizes a reward penalty -- the value of which is prescribed by our theoretical analysis -- to guarantee safety (under some assumptions). Experiments indicate that the practical instantiation of our algorithm, , effectively reduces the number of safety violations on several continuous control tasks, achieving a comparable performance with far fewer safety violations compared to several model-free safe RL algorithms. Code is made available at https://github.com/gwthomas/Safe-MBPO.

Background

In this work, we consider a deterministic[^1] Markov decision process (MDP) $M = (\mathcal{S}, \mathcal{A}, T, r, \gamma)$, where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $T : \mathcal{S}\times \mathcal{A}\to \mathcal{S}$ the transition dynamics, $r : \mathcal{S}\times \mathcal{A}\to [r_{\min}, r_{\max}]$ the reward function, and $\gamma \in [0,1)$ the discount factor. A policy $\pi : \mathcal{S}\to \Delta(\mathcal{A})$ determines what action to take at each state. A trajectory is a sequence of states and actions $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots)$ where $s_{t+1} = T(s_t,a_t)$ and $r_t = r(s_t,a_t)$.

Typically, the goal is to find a policy which maximizes the expected discounted return $\eta(\pi) = \mathbb{E}^\pi[\sum_{t=0}^\infty \gamma^t r_t]$. The notation $\mathbb{E}^\pi$ denotes that actions are sampled according to $a_t \sim \pi(s_t)$. The initial state $s_0$ is drawn from an initial distribution which we assume to be fixed and leave out of the notation for simplicity.

The $Q$ function $Q^\pi(s,a) = \mathbb{E}^\pi[\sum_{t=0}^\infty \gamma^t r_t ,|,s_0=s, a_0=a]$ quantifies the conditional performance of a policy $\pi$ assuming it starts in a specific state $s$ and takes action $a$, and the value function $V^\pi(s) = \mathbb{E}{a \sim \pi(s)}[Q^\pi(s,a)]$ averages this quantity over actions. The values of the best possible policies are denoted $Q^*(s,a) = \max\pi Q^\pi(s,a)$ and $V^(s) = \max_\pi V^\pi(s)$. The function $Q^$ has the important property that any optimal policy $\pi^* \in \arg\max_\pi \eta(\pi)$ must satisfy $\mathbb{P}(a^* \in \arg\max_a Q^(s,a)) = 1$ for all states $s$ and actions $a^ \sim \pi^(s)$. $Q^$ is the unique fixed point of the Bellman operator $$\mathcal{B}^*Q(s,a) = r(s,a) + \gamma \max_{a'} Q(s', a') \quad \text{where} ~ s' = T(s,a)$$

In model-based RL, the algorithm estimates a dynamics model $\widehat{T}$ using the data observed so far, then uses the model for planning or data augmentation. The theoretical justification for model-based RL is typically based some version of the "simulation lemma", which roughly states that if $\widehat{T}\approx T$ then $\hat\eta(\pi) \approx \eta(\pi)$ [@kearns2002near; @luo2018algorithmic].

Method

In this work, we train safe policies by modifying the reward function to penalize safety violations. We assume that the engineer specifies $\mathcal{S}_{\textup{unsafe}}$, the set of states which are considered safety violations.

We must also account for the existence of states which are not themselves unsafe, but lead inevitably to unsafe states regardless of what actions are taken.

::: {.definition} Definition 1. A state $s$ is said to be

We remark that these definitions are similar to those introduced in prior work on safe RL [@hans2008safe]. Crucially, we do not assume that the engineer specifies which states are (ir)recoverable, as that would require knowledge of the system dynamics. However, we do assume that a safety violation must come fairly soon after entering an irrecoverable region:

::: {#ass:fastfail .assumption} Assumption 1. There exists a horizon $H^ \in \mathbb{N}$ such that, for any irrecoverable states $s$, any sequence of actions $a_0, \dots, a_{H^-1}$ will lead to an unsafe state. That is, if $s_0 = s$ and $s_{t+1} = T(s_{t}, a_{t})$ for all $t \in {0, \dots, H^-1}$, then $s_{\bar{t}} \in \mathcal{S}_{\textup{unsafe}}$ for some $\bar{t} \in {1, \dots, H^}$. :::

This assumption rules out the possibility that a state leads inevitably to termination but takes an arbitrarily long time to do so. The implication of this assumption is that a perfect lookahead planner which considers the next $H$ steps into the future can avoid not only the unsafe states, but also any irrecoverable states, with some positive probability.

Reward penalty framework

Now we present a reward penalty framework for guaranteeing safety. Let $\widetilde{M}C = (\mathcal{S}, \mathcal{A}, \widetilde{T}, \tilde{r}, \gamma)$ be an MDP with reward function and dynamics $$\big(\tilde{r}(s,a), \widetilde{T}(s,a)\big) = \begin{cases} (r(s,a), T(s,a)) & s \not\in \mathcal{S}{\textup{unsafe}}\ (-C, s) & s\in \mathcal{S}_{\textup{unsafe}} \end{cases}$$ where the terminal cost $C \in \mathbb{R}$ is a constant (more on this below). That is, unsafe states are "absorbing" in that they transition back into themselves and receive the reward of $-C$ regardless of what action is taken.

The basis of our approach is to determine how large $C$ must be so that the Q values of actions leading to unsafe states are less than the Q values of safe actions.

::: {.lemma} Lemma 1. Suppose that Assumption Assumption 1{reference-type="ref" reference="ass:fastfail"} holds, and let $$C > \frac{r_{\max}- r_{\min}}{\gamma^{H^}} - r_{\max}. \label{eqn:theory-C}$$ Then for any state $s$, if $a$ is a safe action (i.e. $T(s,a)$ is a safe state) and $a'$ is an unsafe action (i.e. $T(s,a)$ is unsafe), it holds that $\widetilde{Q}^(s,a) > \widetilde{Q}^(s,a')$, where $\widetilde{Q}^$ is the $Q^$ function for the MDP $\widetilde{M}_C$.* :::

[[lemma:theory-C]]{#lemma:theory-C label="lemma:theory-C"}

::: {.proof} Proof. Since $a'$ is unsafe, it leads to an unsafe state in at most $H^$ steps by assumption. Thus the discounted reward obtained is at most $$\sum_{t=0}^{H^-1} \gamma^t r_{\max} + \sum_{t=H^}^\infty \gamma^t (-C) = \frac{r_{\max}(1-\gamma^{H^}) - C\gamma^{H^}}{1-\gamma}$$ By comparison, the safe action $a$ leads to another safe state, where it can be guaranteed to never encounter a safety violation. The reward of staying within the safe region forever must be at least $\frac{r_{\min}}{1-\gamma}$. Thus, it suffices to choose $C$ large enough that $$\frac{r_{\max}(1-\gamma^{H^}) - C\gamma^{H^*}}{1-\gamma} < \frac{r_{\min}}{1-\gamma}$$ Rearranging, we arrive at the condition stated. ◻ :::

The important consequence of this result is that an optimal policy for this MDP $\widetilde{M}$ will always take safe actions. However, in practice we cannot compute $\widetilde{Q}^*$ without knowing the dynamics model $T$. Therefore we extend our result to the model-based setting where the dynamics are imperfect.

Extension to model-based rollouts

We prove safety for the following theoretical setup. Suppose we have a dynamics model that outputs sets of states $\widehat{T}(s,a) \subseteq \mathcal{S}$ to account for uncertainty.

::: {.definition} Definition 2. We say that a set-valued dynamics model $\widehat{T}: \mathcal{S}\times \mathcal{A}\to \mathcal{P}(\mathcal{S})$[^2] is calibrated if $T(s,a) \in \widehat{T}(s,a)$ for all $(s,a) \in \mathcal{S}\times \mathcal{A}$. :::

[[def:calibration]]{#def:calibration label="def:calibration"}

We define the Bellmin operator: $$\underline{\mathcal{B}}^*Q(s,a) = \tilde{r}(s,a) + \gamma \min_{s' \in \widehat{T}(s,a)} \max_{a'} Q(s',a')$$

::: {.lemma} Lemma 2. The Bellmin operator $\underline{\mathcal{B}}^$ is a $\gamma$-contraction in the $\infty$-norm.* :::

[[lemma:bellmin]]{#lemma:bellmin label="lemma:bellmin"} The proof is deferred to Appendix 7.1{reference-type="ref" reference="app:proofs"}. As a consequence Lemma [lemma:bellmin]{reference-type="ref" reference="lemma:bellmin"} and Banach's fixed-point theorem, $\underline{\mathcal{B}}^$ has a unique fixed point $\underline{Q}^$ which can be obtained by iteration. This fixed point is a lower bound on the true Q function if the model is calibrated:

::: {.lemma} Lemma 3. If $\widehat{T}$ is calibrated in the sense of Definition [def:calibration]{reference-type="ref" reference="def:calibration"}, then $\underline{Q}^(s,a) \leq \widetilde{Q}^(s,a)$ for all $(s,a)$. :::

[[lemma:bellmin-lb]]{#lemma:bellmin-lb label="lemma:bellmin-lb"}

::: {.proof} Proof. Let $\tilde{\mathcal{B}}^*$ denote the Bellman operator with reward function $\tilde{r}$. First, observe that for any $Q, Q': \mathcal{S}\times \mathcal{A}\to \mathbb{R}$, $Q \leq Q'$ pointwise implies $\underline{\mathcal{B}}^*Q \leq \mathcal{B}^Q'$ pointwise because we have $\tilde{r}(s,a) + \gamma \max_{a'} Q(s',a') \leq \tilde{r}(s,a) + \gamma \max_{a'} Q'(s',a')$ pointwise and the min defining $\underline{\mathcal{B}}^$ includes the true $s' = T(s,a)$.

Now let $Q_0$ be any inital $Q$ function. Define $\widetilde{Q}_k = (\tilde{\mathcal{B}}^)^k Q_0$ and $\underline{Q}_k = (\underline{\mathcal{B}}^)^k Q_0$. An inductive argument coupled with the previous observation shows that $\underline{Q}_k \leq \widetilde{Q}k$ pointwise for all $k \in \mathbb{N}$. Hence, taking the limits $\widetilde{Q}^* = \lim{k \to \infty} \widetilde{Q}k$ and $\underline{Q}^* = \lim{k \to \infty} \underline{Q}_k$, we obtain $\underline{Q}^* \leq \widetilde{Q}^*$ pointwise. ◻ :::

Now we are ready to present our main theoretical result.

::: {.theorem} Theorem 1. Let $\widehat{T}$ be a calibrated dynamics model and $\pi^(s) = \arg\max_a \underline{Q}^(s,a)$ the greedy policy with respect to $\underline{Q}^$. Assume that Assumption Assumption 1{reference-type="ref" reference="ass:fastfail"} holds. Then for any $s \in \mathcal{S}$, if there exists an action $a$ such that $\underline{Q}^(s,a) \ge \frac{r_{\min}}{1-\gamma}$, then $\pi^(s)$ is a safe action.* :::

[[thm:main]]{#thm:main label="thm:main"}

::: {.proof} Proof. Lemma [lemma:bellmin-lb]{reference-type="ref" reference="lemma:bellmin-lb"} implies that $\underline{Q}^(s,a) \leq \widetilde{Q}^(s,a)$ for all $(s,a) \in \mathcal{S}\times \mathcal{A}$.

As shown in the proof of Lemma [lemma:theory-C]{reference-type="ref" reference="lemma:theory-C"}, any unsafe action $a'$ satisfies $$\underline{Q}^(s,a') \leq \widetilde{Q}^(s,a') \leq \frac{r_{\max}(1-\gamma^{H^}) - C\gamma^{H^}}{1-\gamma}$$ Similarly if $\underline{Q}^(s,a) \geq \frac{r_{\min}}{1-\gamma}$, we also have $$\frac{r_{\min}}{1-\gamma} \leq \underline{Q}^(s,a) \leq \widetilde{Q}^(s,a)$$ so $a$ is a safe action. Taking $C$ as in inequality [eqn:theory-C]{reference-type="eqref" reference="eqn:theory-C"} guarantees that $\underline{Q}^(s,a) > \underline{Q}^(s,a')$, so the greedy policy $\pi^$ will choose $a$ over $a'$. ◻ :::

This theorem gives us a way to establish safety using only short-horizon predictions. The conclusion conditionally holds for any state $s$, but for $s$ far from the observed states, we expect that $\widehat{T}(s,a)$ likely has to contain many states in order to satisfy the assumption that it contains the true next state, so that $\underline{Q}^(s,a)$ will be very small and we may not have any action such that $\underline{Q}^(s,a) \ge \frac{r_{\min}}{1-\gamma}$. However, it is plausible to believe that there can be such an $a$ for the set of states in the replay buffer, ${s : (s,a,r,s') \in \mathcal{D}}$.

Practical algorithm

[[alg:practical]]{#alg:practical label="alg:practical"}

Horizon $H$ Initialize empty buffers $\mathcal{D}$ and $\widehat{\mathcal{D}}$, an ensemble of probabilistic dynamics ${\widehat{T}{\theta_i}}{i=1}^N$, policy $\pi_{\phi}$, critic $Q_{\psi}$.

Collect initial data using random policy, add to $\mathcal{D}$.

Collect episode using $\pi_{\phi}$; add the samples to $\mathcal{D}$. Let $\ell$ be the length of the episode. Re-fit models ${\widehat{T}{\theta_i}}{i=1}^N$ by several epochs of SGD on $L_{\widehat{T}}(\theta_i)$ defined in [eqn:model-obj]{reference-type="eqref" reference="eqn:model-obj"} Compute empirical $r_{\min}$ and $r_{\max}$, and update $C$ according to [eqn:theory-C]{reference-type="eqref" reference="eqn:theory-C"}. Sample $s \sim \mathcal{D}$. Startin from $s$, roll out $H$ steps using $\pi_\phi$ and ${\widehat{T}{\theta_i}}$; add the samples to $\widehat{\mathcal{D}}$. Draw samples from $\mathcal{D}\cup \widehat{\mathcal{D}}$. Update $Q{\psi}$ by SGD on $L_Q(\psi)$ defined in [eqn:q-obj]{reference-type="eqref" reference="eqn:q-obj"} and target parameters $\bar\psi$ according to [eqn:target-update]{reference-type="eqref" reference="eqn:target-update"}. Update $\pi_{\phi}$ by SGD on $L_\pi(\phi)$ defined in [eqn:policy-obj]{reference-type="eqref" reference="eqn:policy-obj"}.

Based (mostly) on the framework described in the previous section, we develop a deep model-based RL algorithm. We build on practices established in previous deep model-based algorithms, particularly MBPO [@janner2019trust] a state-of-the-art model-based algorithm (which does not emphasize safety).

The algorithm, dubbed Safe Model-Based Policy Optimization (SMBPO), is described in Algorithm [alg:practical]{reference-type="ref" reference="alg:practical"}. It follows a common pattern used by online model-based algorithms: alternate between collecting data, re-fitting the dynamics models, and improving the policy.

Following prior work [@chua2018deep; @janner2019trust], we employ an ensemble of (diagonal) Gaussian dynamics models ${\widehat{T}{\theta_i}}{i=1}^N$, where $\widehat{T}i(s, a) = \mathcal{N}(\mu{\theta_i}(s,a), \mathop{\mathrm{diag}}(\sigma_{\theta_i}^2(s,a)))$, in an attempt to capture both aleatoric and epistemic uncertainties. Each model is trained via maximum likelihood on all the data observed so far: $$L_{\widehat{T}}(\theta_i) = -\mathbb{E}{(s,a,r,s') \sim \mathcal{D}} \log \widehat{T}{\theta_i}(s', r ,|,s, a) \label{eqn:model-obj}$$ However, random differences in initialization and mini-batch order while training lead to different models. The model ensemble can be used to generate uncertainty-aware predictions. For example, a set-valued prediction can be computed using the means $\widehat{T}(s,a) = {\mu_{\theta_i}(s,a)}_{i=1}^N$.

The models are used to generate additional samples for fitting the $Q$ function and updating the policy. In MBPO, this takes the form of short model-based rollouts, starting from states in $\mathcal{D}$, to reduce the risk of compounding error. At each step in the rollout, a model $\widehat{T}_i$ is randomly chosen from the ensemble and used to predict the next state. The rollout horizon $H$ is chosen as a hyperparameter, and ideally exceeds the (unknown) $H^*$ from Assumption Assumption 1{reference-type="ref" reference="ass:fastfail"}. In principle, one can simply increase $H$ to ensure it is large enough, but this increases the opportunity for compounding error.

MBPO is based on the soft actor-critic (SAC) algorithm, a widely used off-policy maximum-entropy actor-critic algorithm [@haarnoja2018soft]. The $Q$ function is updated by taking one or more SGD steps on the objective $$\begin{aligned} L_Q(\psi) &= \mathbb{E}{(s,a,r,s') \sim \mathcal{D}\cup \widehat{\mathcal{D}}}[(Q\psi(s,a) - (r + \gamma V_{\bar\psi}(s'))^2] \label{eqn:q-obj} \ \text{where}\quad V_{\bar\psi}(s') &= \begin{cases} -C/(1-\gamma) & s' \in \mathcal{S}{\textup{unsafe}}\ \mathbb{E}{a' \sim \pi(s')}[Q_{\bar\psi}(s', a') - \alpha \log \pi_\phi(a' ,|,s')] & s' \not\in \mathcal{S}_{\textup{unsafe}} \end{cases} \label{eqn:v-def}\end{aligned}$$ The scalar $\alpha$ is a hyperparameter of SAC which controls the tradeoff between entropy and reward. We tune $\alpha$ using the procedure suggested by [@haarnoja2018soft2].

The $\bar\psi$ are parameters of a "target" Q function which is updated via an exponential moving average towards $\psi$: $$\bar\psi \leftarrow \tau\psi + (1-\tau)\bar\psi \label{eqn:target-update}$$ for a hyperparameter $\tau \in (0,1)$ which is often chosen small, e.g., 0.005. This is a common practice used to promote stability in deep RL, originating from @lhphetsw15. We also employ the clipped double-Q method [@fujimoto2018addressing] in which two copies of the parameters ($\psi_1$ and $\psi_2$) and target parameters ($\bar\psi_1$ and $\bar\psi_2$) are maintained, and the target value in equation [eqn:v-def]{reference-type="eqref" reference="eqn:v-def"} is computed using $\min_{i=1,2} Q_{\bar\psi_i}(s',a')$.

Note that in [eqn:q-obj]{reference-type="eqref" reference="eqn:q-obj"}, we are fitting to the average TD target across models, rather than the min, even though we proved Theorem [thm:main]{reference-type="ref" reference="thm:main"} using the Bellmin operator. We found that taking the average worked better empirically, likely because the min was overly conservative and harmed exploration.

The policy is updated by taking one or more steps to minimize $$L_\pi(\phi) = \mathbb{E}{s \sim \mathcal{D}\cup \widehat{\mathcal{D}}, a \sim \pi\phi(s)}[\alpha \log \pi_{\phi}(a ,|,s) - Q_{\psi}(s,a)]. \label{eqn:policy-obj}$$

Experiments

In the experimental evaluation, we compare our algorithm to several model-free safe RL algorithms, as well as MBPO, on various continuous control tasks based on the MuJoCo simulator [@todorov2012mujoco]. Additional experimental details, including hyperparameter selection, are given in .

Tasks

Hopper{width="\textwidth"}

Cheetah-no-flip{width="\textwidth"}

Ant{width="\textwidth"}

Humanoid{width="\textwidth"}

The tasks are described below:

For all of the tasks, the reward corresponds to positive movement along the $x$-axis (minus some small cost on action magnitude), and safety violations cause the current episode to terminate. See Figure [fig:tasks]{reference-type="ref" reference="fig:tasks"} for visualizations of the termination conditions.

Algorithms

We compare against the following algorithms:

All of the above algorithms except for MBPO are as implemented in the Recovery RL paper [@thananjeyan2020recovery] and its publicly available codebase^3. We follow the hyperparameter tuning procedure described in their paper; see Appendix 7.3{reference-type="ref" reference="app:impl-details"} for more details. A recent work [@bharadhwaj2020conservative] can also serve as a baseline but the code has not been released.

Our algorithm requires very little hyperparameter tuning. We use $\gamma = 0.99$ in all experiments. We tried both $H = 5$ and $H = 10$ and found that $H = 10$ works slightly better, so we use $H = 10$ in all experiments.

Undiscounted return of policy vs. total safety violations. We run 5 seeds for each algorithm independently and average the results. The curves indicate mean of different seeds and the shaded areas indicate one standard deviation centered at the mean. {#fig:return_vs_violations width="\textwidth"}

Results

The main criterion in which we are interested is performance (return) vs. the cumulative number of safety violations. The results are plotted in Figure 2{reference-type="ref" reference="fig:return_vs_violations"}. We see that our algorithm performs favorably compared to model-free alternatives in terms of this tradeoff, achieving similar or better performance with a fraction of the violations.

MBPO is competitive in terms of sample efficiency but incurs more safety violations because it isn't designed explicitly to avoid them.

Performance with varying $C${width="\textwidth"}

Cumulative safety violations with varying $C${width="\textwidth"}

We also show in Figure [fig:vary-C]{reference-type="ref" reference="fig:vary-C"} that hard-coding the value of $C$ leads to an intuitive tradeoff between performance and safety violations. With a larger $C$, SMBPO incurs substantially fewer safety violations, although the total rewards are learned slower.

Related Work

Safe Reinforcement Learning

Many of the prior works correct the action locally, that is, changing the action when the action is detected to lead to an unsafe state. @dalal2018safe linearizes the dynamics and adds a layer on the top of the policy for correction. @bharadhwaj2020conservative uses rejection sampling to ensure the action meets the safety requirement. @thananjeyan2020recovery either trains a backup policy which is only used to guarantee safety, or uses model-predictive control (MPC) to find the best action sequence. MPC could also be applied in the short-horizon setting that we consider here, but it involves high runtime cost that may not be acceptable for real-time robotics control. Also, MPC only optimizes for rewards under the short horizon and can lead to suboptimal performance on tasks that require longer-term considerations [@tamar2017learning].

Other works aim to solve the constrained MDP more efficiently and better, with Lagrangian methods being applied widely. The Lagrangian multipliers can be a fixed hyperparameter, or adjusted by the algorithm [@tessler2018reward; @stooke2020responsive]. The policy training might also have issues. The issue that the policy might change too fast so that it's no longer safe is addressed by building a trust region of policies [@achiam2017constrained; @zanger2021safe] and further projecting to a safer policy [@yang2020accelerating], and another issue of too optimistic policy is addressed by @bharadhwaj2020conservative by using conservative policy updates. Expert information can greatly improve the training-time safety. @srinivasan2020learning [@thananjeyan2020recovery] are provided offline data, while @turchetta2020safe is provided interventions which are invoked at dangerous states and achieves zeros safety violations during training.

Returnability is also considered by @eysenbach2018leave in practice, which trains a policy to return to the initial state, or by @roderick2021provably in theory, which designs a PAC algorithm to train a policy without safety violations. @bansal2017hamilton gives a brief overview of Hamilton-Jacobi Reachability and its recent progress.

Model-based Reinforcement Learning

Model-based reinforcement learning, which additionally learns the dynamics model, has gained its popularity due to its superior sample efficiency. @kurutach2018model uses an ensemble of models to produce imaginary samples to regularize leaerning and reduce instability. The use of model ensemble is further explored by @chua2018deep, which studies different methods to sample trajectories from the model ensemble. Based on @chua2018deep, @wang2019exploring combines policy networks with online learning. @luo2019algorithmic derives a lower bound of the policy in the real environment given its performance in the learned dynamics model, and then optimizes the lower bound stochastically. Our work is based on @janner2019trust, which shows the learned dynamics model doesn't generalize well for long horizon and proposes to use short model-generated rollouts instead of a full episodes. @dong2020expressivity studies the expressivity of $Q$ function and model and shows that at some environments, the model is much easier to learn than the $Q$ function.

Conclusion

We consider the problem of safe exploration in reinforcement learning, where the goal is to discover a policy that maximizes the expected return, but additionally desire the training process to incur minimal safety violations. In this work, we assume access to a user-specified function which can be queried to determine whether or not a given state is safe. We have proposed a model-based algorithm that can exploit this information to anticipate safety violations before they happen and thereby avoid them. Our theoretical analysis shows that safety violations could be avoided with a sufficiently large penalty and accurate dynamics model. Empirically, our algorithm compares favorably to state-of-the-art model-free safe exploration methods in terms of the tradeoff between performance and total safety violations, and in terms of sample complexity.

Acknowledgements {#acknowledgements .unnumbered}

TM acknowledges support of Google Faculty Award, NSF IIS 2045685, the Sloan Fellowship, and JD.com. YL is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, DARPA and SRC.

Appendix

Proofs {#app:proofs}

::: {.proof} Proof of Lemma [lemma:bellmin]{reference-type="ref" reference="lemma:bellmin"} ($\underline{\mathcal{B}}^$ is a $\gamma$-contraction in $\infty$-norm).* First observe that for any functions $f$ and $g$, $$|\max_x f(x) - \max_x g(x)| \le \max_x |f(x)-g(x)| \label{eqn:max-ineq}$$ To see this, suppose $\max_x f(x) > \max_x g(x)$ (the other case is symmetric) and let $\tilde{x} = \arg\max_x f(x)$. Then $$|\max_x f(x) - \max_x g(x)| = f(\tilde{x}) - \max_x g(x) \le f(\tilde{x}) - g(\tilde{x}) \le \max_x |f(x)-g(x)|$$ We also note that [eqn:max-ineq]{reference-type="eqref" reference="eqn:max-ineq"} implies $$|\min_x f(x) - \min_x g(x)| \le \max_x |f(x)-g(x)|$$ since $\min_x f(x) = -\max_x (-f(x))$. Thus for any $Q, Q' : \mathcal{S}\times \mathcal{A}\to \mathbb{R}$, $$\begin{aligned} |\underline{\mathcal{B}}^*Q - \underline{\mathcal{B}}^*Q'|\infty &= \sup{s,a} |\underline{\mathcal{B}}^*Q(s,a) - \underline{\mathcal{B}}^Q'(s,a)| \ &= \gamma \sup_{s,a} \left|\min_{s' \in \widehat{T}(s,a)} \max_{a'} Q(s',a') - \min_{s' \in \widehat{T}(s,a)} \max_{a'} Q'(s',a')\right| \ &\le \gamma \sup_{s,a} \max_{s' \in \widehat{T}(s,a)} \left|\max_{a'} Q(s',a') - \max_{a'} Q'(s',a')\right| \ &\le \gamma \sup_{s',a'} |Q(s',a') - Q'(s',a')| \ &= \gamma |Q - Q'|_\infty\end{aligned}$$ Hence $\underline{\mathcal{B}}^$ is indeed a $\gamma$-contraction. ◻ :::

Extension to stochastic dynamics {#app:stochastic}

Here we outline a possible extension to stochastic dynamics, although we leave experiments with stochastic systems for future work.

First, let us modify the definitions to accommodate stochastic dynamics:

Our rapid failure assumption must also be extended: There exists a horizon $H$ and threshold $q$ such that if $(s,a)$ is $p$-irrecoverable, then for any sequence of actions ${a_t}_{t=0}^\infty$ with $a_0 = a$, the probability of encountering an unsafe state within $H$ steps is at least $q$. (Note that necessarily $q \le p$.)

Analysis

Let $s$ be a $p$-safe state, and let $a$ and $a'$ be actions where $a$ is $p$-safe but $a'$ is $p$-irrecoverable[^4]. We want to have $\widetilde{Q}^(s,a) > \widetilde{Q}^(s,a')$ so that the greedy policy w.r.t. $\widetilde{Q}^$, which is an optimal policy for $\widetilde{M}$, will only take $p$-safe actions. Our strategy is to bound $\widetilde{Q}^(s,a')$ from above and $\widetilde{Q}^*(s,a)$ from below, then choose $C$ to make the desired inequality hold.

We consider $a'$ first, breaking it down into two cases:

From the reasoning above, we obtain $$\begin{aligned} \widetilde{Q}^(s,a') &\le \mathbb{P}(\text{unsafe within $H$ steps}) \cdot (\text{max return} ,|,\text{unsafe within $H$ steps}) ,+ \ &\quad \mathbb{P}(\text{safe for $H$ steps}) \cdot (\text{max return} ,|,\text{safe for $H$ steps}) \ &\le \max{qR_C, R_C} + (1-q)\frac{r_{\max}}{1-\gamma}\end{aligned}$$ Now consider $a$. Since $(s,a)$ is $p$-safe, $$\begin{aligned} \widetilde{Q}^(s,a) &\ge \mathbb{P}(\text{unsafe}) \cdot (\text{min reward} ,|,\text{unsafe}) + \mathbb{P}(\text{safe}) \cdot (\text{min reward} ,|,\text{safe}) \ &\ge p \left(\frac{-C}{1-\gamma}\right) + (1-p) \frac{r_{\min}}{1-\gamma} \ &= \frac{-p C + (1-p)r_{\min}}{1-\gamma}\end{aligned}$$ Note that the second step assumes $C \ge 0$. (We will enforce this constraint when choosing $C$.)

To ensure $\widetilde{Q}^(s,a) > \widetilde{Q}^(s,a')$, it suffices to choose $C$ so that the following inequalities hold simultaneously: $$\begin{aligned} \frac{-pC + (1-p)r_{\min}}{1-\gamma} &> qR_C + (1-q)\frac{r_{\max}}{1-\gamma} \label{eq:ineq1} \ \frac{-pC + (1-p)r_{\min}}{1-\gamma} &> R_C + (1-q)\frac{r_{\max}}{1-\gamma} \label{eq:ineq2}\end{aligned}$$ Multiplying both sides of [eq:ineq1]{reference-type="eqref" reference="eq:ineq1"} by $1-\gamma$ gives the equivalent $$-pC + (1-p)r_{\min}> qr_{\max}(1-\gamma^H) - qC\gamma^H + (1-q)r_{\max}$$ Rearranging, we need $$C > \frac{r_{\max}(1-q\gamma^H) - (1-p)r_{\min}}{q\gamma^H-p} =: \alpha_1$$ Similarly, multiplying both sides of [eq:ineq2]{reference-type="eqref" reference="eq:ineq2"} by $1-\gamma$ gives the equivalent $$-pC + (1-p)r_{\min}> r_{\max}(1-\gamma^H) - C\gamma^H + (1-q)r_{\max}$$ Rearranging, we need $$C > \frac{r_{\max}(2-q-\gamma^H) - (1-p)r_{\min}}{\gamma^H-p} =: \alpha_2$$ All things considered, the inequality $\widetilde{Q}^(s,a) > \widetilde{Q}^(s,a')$ holds if we set $$C > \max{\alpha_1, \alpha_2, 0}$$

Implementation details and hyperparameters {#app:impl-details}

In this appendix we provide additional details regarding the algorithmic implementation, including hyperparameter selection.

Here are some additional details regarding the (S)MBPO implementation:

The model-free algorithms have their own hyperparameters, but all share $\gamma_{\text{safe}}$ and $\epsilon_{\text{safe}}$. Following @thananjeyan2020recovery, we tune $\gamma_{\text{safe}}$ and $\epsilon_{\text{safe}}$ for recovery RL first, then hold those fixed for all algorithms and tune any remaining algorithm-specific hyperparameters. All these hyperparameters are given in the tables below:

         Name             Which algorithm(s)?       Choices        hopper   cheetah   ant   humanoid

$\gamma_{\text{safe}}$            all            0.5, 0.6, 0.7      0.6       0.5     0.6     0.6

$\epsilon_{\text{safe}}$ all 0.2, 0.3, 0.4 0.3 0.2 0.2 0.4 $\nu$ LR 1, 10, 100, 1000 1000 1000 1 1 $\nu$ SQRL 1, 10, 100, 1000 1 1000 10 1 $\lambda$ RCPO 1, 10, 100, 1000 10 10 1 10

We run our experiments using a combination of NVIDIA GeForce GTX 1080 Ti, TITAN Xp, and TITAN RTX GPUs from our internal cluster. A single run of (S)MBPO takes as long as 72 hours on a single GPU.

[^1]: Determinism makes safety essentially trivial in tabular MDPs. We focus on tasks with continuous state and/or action spaces. See Appendix 7.2{reference-type="ref" reference="app:stochastic"} for a possible extension of our approach to stochastic dynamics.

[^2]: $\mathcal{P}(X)$ is the powerset of a set $X$.

[^4]: Note that, as a consequence of the definitions, any action which is $p'$-safe with $p' < p$ is also $p$-safe, and similarly any action which is $p'$-irrecoverable with $p' > p$ is also $p$-irrecoverable.