Preventing Imitation Learning with Adversarial Policy Ensembles

http://arxiv.org/abs/2002.01059v2

abstract: | Imitation learning can reproduce policies by observing experts, which poses a problem regarding policy privacy. Policies, such as human, or policies on deployed robots, can all be cloned without consent from the owners. How can we protect against external observers cloning our proprietary policies? To answer this question we introduce a new reinforcement learning framework, where we train an ensemble of near-optimal policies, whose demonstrations are guaranteed to be useless for an external observer. We formulate this idea by a constrained optimization problem, where the objective is to improve proprietary policies, and at the same time deteriorate the virtual policy of an eventual external observer. We design a tractable algorithm to solve this new optimization problem by modifying the standard policy gradient algorithm. Our formulation can be interpreted in lenses of confidentiality and adversarial behaviour, which enables a broader perspective of this work. We demonstrate the existence of "non-clonable" ensembles, providing a solution to the above optimization problem, which is calculated by our modified policy gradient algorithm. To our knowledge, this is the first work regarding the protection of policies in Reinforcement Learning. author:


Introduction

Imitation learning and behavioral cloning provide really strong ability to create powerful policies, as seen in robotic tasks ([@DART; @oneshotIL; @BCLimitations; @end2endCondIL; @alvinn; @end2endSelfDriving]). Other fields in machine learning have developed methods to ensure privacy (@privateML [@PATE]), however, none have examined protection against policy cloning. In this work, we tackle the issue of protecting policies by training policies that aim to prevent an external observer from using behaviour cloning. Our approach draws inspiration from imitating human experts, who can near-optimally accomplish given tasks. The setting which we analyze is presented in Figure 1{reference-type="ref" reference="fig:scheme"}. We wish to find a collection of experts, which as an ensemble can perform a given task well, however, also targets behaviour cloning through adversarial behaviour. Another interpretation is that this collection of experts represents the worst case scenario for behaviour cloning on how to perform a task "good enough".

image{#fig:scheme} [[fig:scheme]]{#fig:scheme label="fig:scheme"}

Imitation learning frameworks generally make certain assumptions of the optimality of the demonstrations ([@maxentIRL; @controlasoptimalinference]), yet never considered the scenario when the experts specifically attempt to be adversarial to the imitator. We pose the novel question regarding this assumption: does there exist a set of experts that are adversarial to an external observer trying to behaviour clone?

We propose Adversarial Policy Ensembles (APE), a method that simultaneously optimizes the performance of the ensemble and minimizes the performance of policies eventually obtained from cloning it. Our experiments show that APE do not suffer much performance loss from an optimal policy, while causing, on average, the cloned policy to experience over $5$ times degradation compared to the optimal policy.

Our main contributions can be summarized as follows:

To our knowledge, not only is this the first work regarding the protection of policies in reinforcement learning, but it is also the first to represent adversarial experts.

Preliminaries

We develop APE in the standard framework of Reinforcement Learning (RL). The main components we use are Markov Decision Processes, Policy Gradient (@PG), policy ensembles, and behaviour cloning, which we review below.

Markov Decision Process

A discrete-time finite-horizon discounted Markov decision process (MDP) $\mathcal{M}$ is defined by $(\mathcal{S}, \mathcal{A}, r, p, p_0, \gamma, T)$ where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $r : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is the reward function, $p(s_{t+1} | s_t, a_t)$ is the transition probability distribution, $p_0 : \mathcal{S} \rightarrow \mathbb{R}^{+}$ is the initial state distribution, $\gamma \in (0, 1)$ is the discount factor, and $T$ is the time horizon. A trajectory $\tau \sim \rho_\pi$, sampled from $p$ and a policy $\pi : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}^+$, is defined to be the states and actions tuple $(s_0, a_0, ... s_{T-1}, a_{T-1}, s_T)$, whose distribution is characterized by $\rho_\pi$. Define the return of a trajectory to be $r(\tau) = \sum_{t=0}^{T-1} \gamma ^{t} r(s_{t}, a_{t})$ to be the sum of discounted rewards seen along the trajectory, and define a value function $V^\pi : \mathcal{S} \rightarrow \mathbb{R}$ to be expected return of a trajectory starting from state $s$, under the policy $\pi$. The goal of reinforcement learning is to find a policy that maximizes the expected return $\mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho\pi} [r(\tau)]$.

Policy Gradient

Policy Gradient (PG) @PG aim to directly learn the optimal policy $\pi$, parameterized by $\theta$, by repeatedly estimating the gradient of the expected return, in one of many forms, shown in @gae. In our work, we follow notation similar to that of @gae [@PPO] and estimate $\nabla_\theta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho\pi}[r(\tau)]$ using the advantage, which is estimated from a trajectory $\tau$, $A^\pi_\tau (t) = R_\tau (t) - V^\pi (s_t)$, where $R_\tau (t) = \sum_{t'=t}^{T-1} \gamma ^{t'} r(s_{t'}, a_{t'})$ is the sum of the reward following action $a_t$.

Here, the value function is learned simultaneously with the the policy, and so the advantage will use $\hat{V}^\pi$ as an estimate for $V^\pi$.

Policy Ensemble (PE)

We denote a PE by $\pi_{\textbf{c}}$, where each $\pi_{c^{(i)}}, i \in { 1, 2, ... n }$ represents an expert. To rollout the PE, an expert is chosen at random (in our case uniform), and the expert completes a trajectory. Each expert policy $\pi_{c^{(i)}}(a | s)$ can be viewed as a policy conditioned on a latent variable $c$, $\pi(a | s, c)$.

Although $\pi_{\textbf{c}}$ consists of multiple policies, it is important to note that it itself is still a policy.

Behaviour Cloning

To behaviour clone an expert policy (@ILOriginal), a dataset of trajectories $\mathcal{D}$ consisting of state action pairs $(s, a)$ are collected from the the expert rollouts. Then, a policy parametized by $\phi$ is trained by maximizing the likelihood of an action given a state, $\sum_{(s, a) \in \mathcal{D}} \log \pi_\phi (a \mid s)$.

When cloning $\pi_{\textbf{c}}$, $\mathcal{D}$ will not contain information of the latent variable $c$, and so the cloned policy will marginalize it out. Thus, the observer will clone:

$$\begin{aligned} \label{eqn:observe} \pi_o({a} \mid {s}) \vcentcolon= \sum_i p(c^{(i)}\mid s ) \pi_{c^{(i)}}( a \mid s)\end{aligned}$$

We stress that this policy does not exist until $\pi_{\textbf{c}}$ is behaviour cloned. $\pi_o$ is a fictitious policy to represent what would happen in the best case scenario of the observer having access to infinite data from $\pi_{\textbf{c}}$ to clone into $\pi_o$.

The scope of this paper is to specifically prevent behavioral cloning from succeeding. Other imitation learning approaches such as inverse reinforcement learning (@IRL [@algoIRL; @nonlinear]) and adversarial imitation learning (@GAIL [@VDB]) require rollouts of non-expert policies in the environment, which may be costly, and thus are not considered.

Related Work {#sec:related}

Adversarial Attacks in RL: Our notion of adversarial policies is inextricably related to other adversarial methods that target RL such as @DRLAttack, and @vulnerablePolicies, that add adversarial perturbations to policy input during training. Other adversarial attacks include poisoning the batch of data used when training RL (@poison), and exploitation in the multi-agent setting (@gleaveAttackRL). However, these methods all present as active attacks for various learning techniques. Our method, instead, passively protects against cloning.

Privacy in RL: With regards to protection, our work is related to differential privacy (@privateML). Differential privacy in RL can be used to create private Q-functions (@privateQ) or private policies (@privatePolicy), which have private reward functions or private policy evaluation. However, we would like to emphasize that our motivation is to prevent cloning, and thus protecting the policies, rather than protecting against differentiating between reward functions and policies.

Imitation Learning: Since we comply to the standard imitation learning setting of cloning from a dataset with many experts providing the demonstrations, latent variables w.r.t. imitation learning is well-studied. For example, @end2endCondIL show that conditioning on context representation can make imitation learning a viable option for autonomous driving. @infogail demonstrate that the latent contextual information in expert trajectories is often semantically meaningful. As well, providing extra context variables to condition on also appears in forms of extra queries or providing labels ([@RiskAwareIRL; @causal_imitation_learning; @latentFromDemo]). Our method is different, as we use context variables to prevent imitation learning while learning the policies from scratch, rather than assuming using context variables to increase performance of imitation learning.

Multiple Policies: VALOR, DIAYN, and DADS (@valor [@diayn; @DADS]) have similar schemes of sampling a latent variable and fixing it throughout a trajectory, although their latent variables (contexts or skills) are used to solve semantically different tasks. The reason to solve different tasks is due to the objective of using the context variable/skills for learning in an unsupervised setting. Our approach differs in both motivation and implementation, as we learn experts that all solve the same task, and constrain so that observers can not clone the policy.

A PE $\pi_{\textbf{c}}$ can also be viewed as a Mixture of Experts (@mixexp), except the gating network assigns probability $1$ to the same expert for an entire trajectory. As such, we do not learn the gating network, although it may still be useful to see $\pi_{\textbf{c}}$ as a special case of a mixture of experts where the gating network learns immediately to fix the expert for each trajectory. There are also methods such as OptionGAN (@optiongan), which uses a mixture of experts model to learn multiple policies as options with access to only expert states.

@novel also proposes a method to train multiple policies that complete the same task but uses the uncertainty of an autoencoder as a reward augment. Their motivation is to find multiple novel policies, while our motivation has no connection to novelty. Due to these differences in motivation, they train each policy one after the other, while our policies are trained simultaneously.

Policy ensembles are also used in the multi-task and goal conditioned settings in which case the task that is meant to be solved can be viewed as the context. Marginalizing out the context variable (Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}) of these context-conditioned policies is studied in the case of introducing a KL divergence regularizing term for learning new tasks (@InfoBot) and for sharing/hiding goals (@LSH). However, the main motivation is different in that both @InfoBot and @LSH use $\pi_o$ to optimize mutual information, while we directly optimize its performance.

Method

Objective

We wish to have experts that can perform the task, while minimizing the possible returns of the cloned policy, denoted in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}. We modify the standard RL objective to be: $$\begin{gathered} \label{eqn:constrained} \mathop{\mathrm{arg,min}}\theta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho_{\pi_o}}[r(\tau)]
\text{~~~s.t.~} \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}}[r(\tau)] \geq \alpha % \E_{\tau \sim \rho_{\pim}}[r(\tau)] \leq \alpha\end{gathered}$$

where $\alpha$ is a parameter that lower bounds the reward of the policy ensemble. This translates to maximizing the unconstrained Lagrangian: $$\begin{gathered} \label{eqn:unconstrained} J(\theta) =
\mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_{\textbf{c}}}} [r(\tau)]

\beta \mathop{\mathrm{\mathbb{E}}}{\tau \sim \rho{\pi_o}} [ r(\tau)]\end{gathered}$$

where $1/ \beta$ is the corresponding Lagrangian multiplier, and is subsumed into the returns collected by the policy ensemble. We refer to PE that optimizes this objective as Adversarial Policy Ensembles (APE). There is a natural interpretation of the objective in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"}. Human experts tend to be "good enough", which is reflected in the constraint. The minimization is simply finding the most adversarial experts.

Although we assume that the observer can only map states to actions, it may be the case that they can train a sequential policy, which is dependent on its previous states and actions. Our method can be generalized to sequential policies as well, and the impact of such observers is discussed in the Section 6{reference-type="ref" reference="sec:dicussion"}.

Modified Policy Gradient Algorithm

Intuitively, since there are the returns of two policies that are being optimized, both should be sampled from to estimate the returns.

We show how we can modify PG to train APE, by maximizing Equation [eqn:unconstrained]{reference-type="ref" reference="eqn:unconstrained"}. The two terms suggest a simple scheme to estimate the returns of the policy ensemble twice: once using $\pi_{\textbf{c}}$ that we wish to maximize, and a second time using $\pi_o$, which approximates the returns of an eventual observer who tries to clone the policy ensemble. Along with our PE, we train value functions $\Tilde{V}^{\pi_{c^{(i)}}}$ for each expert, jointly parameterized by $\phi$ which estimates $V^{\pi_{c^{(i)}}} - \beta V^{\pi_o}$. The loss function for the value functions of two sampled trajectories $\tau_1, \tau_2$ is

$$\begin{gathered} \label{sec: returns} J_{\tau_1, \tau_2} (\phi) = % \left[ \sum_{t=0}^{T_1-1} \frac{1}{2} \left(\Tilde{V}\phi ^{\pi{c^{(i)}}} (s_{t_1}) - R_{\tau_1}(t) \right) ^2 + \sum_{t=0}^{T_2-1} \frac{1}{2} \left(\Tilde{V}\phi ^ {\pi{c^{(i)}}} (s_{t 2}) + \beta R{\tau_2}(t) \right)^2 % \right]\end{gathered}$$

The policy gradient update from $N_1$ and $N_2$ trajectories is then $$\begin{gathered} \label{sec: pgupdate} \nabla_\theta J_{{\tau}1 , {\tau}2} (\theta) \approx G_1 + G_2\end{gathered}$$ where $$\begin{gathered} \label{sec: g1} G_1 = \frac{1}{N_1} \displaystyle\sum{j=1}^{N_1} \displaystyle\sum{t=0}^{T_1} \nabla_\theta \log{\pi_{c^{(i)}}(a_{t1}^{(j)} \mid s_{t1}^{(j)}) \Tilde{A}^{\pi_{c^{(i)}}}{\tau_1} (t) }\end{gathered}$$ $$\begin{gathered} \label{sec: g2} G_2 = \frac{1}{N_2} \displaystyle\sum{j=1}^{N_2} \displaystyle\sum_{t=0}^{T_2} \nabla_\theta \log{\pi_o(a_{t2}^{(j)} \mid s_{t2}^{(j)} )} \Tilde{A}^{\pi_o}_{\tau_2} (t) \end{gathered}$$

where $c^{(i)}$ identifies the chosen expert of the trajectory., and $\Tilde{A}^{\pi_{c^{(i)}}}{\tau_1} (t) = R{\tau_1} (t) - \Tilde{V}^{\pi_{c^{(i)}}} (s_t)$ and $\Tilde{A}^{\pi_o}{\tau_2} (t) = -\beta R{\tau_2} (t) - \Tilde{V}^{\pi_o} (s_t)$ are the modified advantage functions. The $-\beta$ that is in the advantage in $G_2$ optimizes against the performance of the observed policy $\pi_o$.

The gradient $G_1$ for $\pi_{\textbf{c}}$ is straightforward. However, to estimate the gradient $G_2$ for $\pi_o$ which is an fictitious policy, we sample from it by first re-sampling the context of the expert at each state, and then sampling an action from the context. The back-propagation occurs to $\pi_{c^{(i)}}(a \mid s)$ for the context sampled at each state. Practical implementation details can be found in [9.2](#sec: appendEst){reference-type="ref" reference="sec: appendEst"}. The intuition is as follow. While sampling $\pi_o$, if a selected action causes high return, we should decrease the probability, which lowers the expected reward of $\pi_o$. Combined, the two gradients will cause the PE to select actions that both achieves high reward, and are detrimental to the observer.

Equations [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"} and [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} formulate our PG approach of APE, which is summarized in Algorithm [alg:CAPE]{reference-type="ref" reference="alg:CAPE"}.

[[alg:CAPE]]{#alg:CAPE label="alg:CAPE"}

for each iteration do: Generate trajectories ${\tau}1$ with $\pi{\textbf{c}}$ from $\mathcal{M}$ for Equation [[sec: g1]](#sec: g1){reference-type="ref" reference="sec: g1"} Generate trajectories ${\tau}_2$ with $\pi_o$ from $\mathcal{M}$ for Equation [[sec: g2]](#sec: g2){reference-type="ref" reference="sec: g2"}

Calculate Equation [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} to perform a gradient update on the PE $\theta \leftarrow \theta + \alpha_\theta \hat{\nabla}\theta J{\tau_1, \tau_2}(\theta)$ Update the value function $\phi \leftarrow \phi - \alpha_\phi \hat{\nabla}\phi J{\tau_1, \tau_2} (\phi)$ as determined by Equation [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"}.

end for

Experiments {#sec:result}

We perform experiments on a navigation task, where the objective is to reach a goal state as fast as possible. The purpose is to illustrate that an APE can cause the cloned policy to take significantly longer to reach the goal state. We do so by first training a PE and behaviour cloning it. We then compare the performance of the PE to that of the clone. We use a discrete environment to best demonstrate the validity of the equation. This is because all discrete policies can be parameterized, which is not true in continuous, where typically Gaussian parameterization is used. As such, continuous environments would have to make assumptions about how both the PE and the cloner parameterizes policies, as well as tackle problems of distributional drift, which we would like to avoid. However, with these assumptions, our setting can extend to the continuous domain. In our experiments, we use a $10 \times 10$ grid-world environment as our main testbed. This is to have large enough expression that would not be found in smaller grids, while still small enough to visualize the behaviour of the APE. The discrete actions will show precisely how the experts can be jointly adversarial.

Using gridworld allows for precise expected return estimates. In an environment where there is no computable analytical solution for the returns, approximation error can accumulate through estimating the returns of both the trained PE and the clone. This noise would only increase in continuous state space, where the returns of $\pi_o$ may not be tractable to estimate due to issues such as distributional drift (@dagger [@BCLimitations; @causal_imitation_learning]).

Our results answer the following questions. How much optimality is compromised? How useless can we make the cloned policy? Is it possible to use non APE to prevent behaviour cloning?

Training {#sec:navigation}

image{#fig:contexted} [[fig:contexted]]{#fig:contexted label="fig:contexted"}

Even though our method can compute a policy ensemble with any finite number of experts, we chose to visualize a solution with 2 experts, which is sufficient to reveal the essential properties of the method. Specifically, we train $n=2$ tabular experts with PG-APE. Our code is written in Tensorflow (@Tensorflow). Training details and hyper-parameters are in Section 9.1{reference-type="ref" reference="sec:hyperparams"} of the Appendix.

Environment

The basic environment is a $10 \times 10$ grid, with the goal state at the top left corner. The agent spawns in a random non-goal state, and incurs a reward of $-1$ for each time-step until it reaches the goal. At the goal state, the agent no longer receives a loss and terminates the episode. The agent is allowed five actions, $\mathcal{A} =$ { Up, Down, Left, Right, Stay }. Moving into the wall is equivalent to executing a Stay action. We choose this reward function for the benefit of having a clear representation of the notion of "good enough", which is reflected in how long it takes to reach the goal state. Having such representation exemplifies how the APE can prevent an observer from cloning a good policy.

Visualization

Figure 2{reference-type="ref" reference="fig:contexted"} shows an example of a PE that is trained for the basic gridworld environment. Figure [fig:cloned]{reference-type="ref" reference="fig:cloned"} shows the corresponding cloned policy, as well as a comparison to an optimal policy. The colour scale represents the expected return of starting at a given state.

In the case of an optimal policy ($\beta=0$), actions are taken to take the agent to the goal state as fast as possible. However, when $\beta > 0$, such a solution is no longer the optimum. Similar to $\beta=0$, the experts would like to maximize the expected reward, and reach the goal state. However, to minimize the reward of the observed policy, the two expert policies must jointly learn to increase the number of steps needed for $\pi_o$ to reach the goal state. The expert policies must use adversarial behaviour while reaching the goal state, such as taking intelligent detours or Stay in the same state, which are learned to hinder $\pi_o$ as much as possible. These learnt behaviours cause the cloned policy to take a drastically longer time to reach the goal. For example, note the two purple squares at the top-left near the goal, which indicates that the experts understand that they should not move to prevent the observer from attaining reward. Even though these sub-optimal decisions are made, on expectation, the experts are "not bad" and achieve an average of $-15.27$ reward.

Baselines

R0.52

image{#fig:optimal} [[fig:optimal]]{#fig:optimal label="fig:optimal"}

We use behaviour cloning to clone our PG-APE trained policies. To support our claims of preventing IL even in the horizon of infinite data, we collect a million timesteps of the trained PE in the environment. Further details of behaviour cloning are in the appendix. Shown in Figure [fig:cloned]{reference-type="ref" reference="fig:cloned"} is an optimal policy, and the resulting cloned policy from Section 5.1{reference-type="ref" reference="sec:navigation"}.

We evaluate against other PE, to show that preventing against behaviour cloning is non-trivial. We use several baselines. We first test policies that have approximately the same return as our ensemble by training PE with vanilla PG, and halting early rather than running until convergence. In the Near-Optimal case, we ran until the PE had expected returns that matched the average achieved by our method. Conversely, "Random" policies are used as a comparison to show that it is possible to cause the cloned policy to do poorly, but the tradeoff is that the PE itself cannot perform well, which is undesirable. These policies are also policies trained with PG, except they are stopped much earlier, when their clones matches the expected returns of our PG-APE. For each PG-APE, we use $n=2$ different tabular policies treated as an ensemble, which we then clone, and average across $5$ seeds. For the baselines, we hand-pick the policies, and thus only use $3$ different policies.

p1.35indd & & &
PG-APE & -16.24 $\pm$ 1.20 & -44.27 $\pm$ 1.07 & -28.03
Near-Optimal PE & -16.74 $\pm$ 1.32 & -16.67 $\pm$ 1.31 & +0.07
Random Policy & -44.59 $\pm$ 0.52 & -44.52 $\pm$ 0.77 & +0.07\

[[default]]{#default label="default"}

As presented in Table [table:PEComp]{reference-type="ref" reference="table:PEComp"}, all other PE have an insignificant difference (returns of the PE subtracted from returns of the cloned policy) between the performance of the PE and the cloned policy, except for our method. These empirical findings show that preventing behaviour cloning difficult, but possible using APE.

Discussion & Future Work {#sec:dicussion}

Confidential Policies: There are promising research directions regarding the protection of policies, due to the many applications where confidentiality is crucial. As long as there is a model of the observer, our presented method provides a worst-case scenario of experts.

In our work, we focused on the case where the observer does not use the current trajectory to determine their policy. Instead, it may be the case that the observer uses a sequential policy (one that depends on its previous states and/or actions), such as an RNN to determine the context of the current expert.

Formally, the observer will no longer learn the policy formulated in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} that is solely dependent on the current state, but rather a policy that is dependent on the current trajectory: $$\begin{aligned} \label{eqn:observeRNN} \pi_o({a} \mid \tau_{1:t}) \vcentcolon= \sum_i p(c^{(i)}\mid \tau_{1:t} ) \pi_{c^{(i)}}( a \mid s)\end{aligned}$$ We found in our preliminary results that using an RNN classifier which outputs $p(c | \tau_{1:t})$ simply ended up in with either optimal policies or crippled policies. In both cases, there was a relatively minor difference in performance between the policy ensemble and the cloned policy.

Unsurprisingly, when the observer has access to a strong enough representation for their policy, then they should be able to imitate any policy. In this case, the worst-case set of experts cannot do much to prevent the cloning. We believe that this is an exciting conclusion, and is grounds for future work.

Continuous: Although our methods are evaluated in discrete state spaces, our approach can be generalized to continuous domains.

The Monte Carlo sampling in Equation [eqn:exp_MC]{reference-type="ref" reference="eqn:exp_MC"} suggests that the use of continuous context may also be possible, given there is a strong enough function approximator to estimate the distribution of $c | s$. We see this as an exciting direction for future work, to recover the full spectrum of possible adversarial policies under the constraint of Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"}.

The Semantics of Reward: Although the minimization in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} implies a logical equivalence between the success of behaviour cloning to the reward the cloned policy can achieve, it may follow that this is not the case. It may be the case that useless is defined differently by the expected reward the cloned policy achieves on a different reward function $\Tilde{r}$. For example, a robot that is unpredictable should not be deployed with humans. Since the $r$ functions in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} are disentangled, the reward function $r$ that is minimized in Equation [eqn:constrained]{reference-type="ref" reference="eqn:constrained"} can be engineered to fit any definition of uselessness.

We can modify the objective of APE by modifying Equations [[sec: returns]](#sec: returns){reference-type="ref" reference="sec: returns"} and [[sec: pgupdate]](#sec: pgupdate){reference-type="ref" reference="sec: pgupdate"} to use a different reward function $\Tilde{r}$ in the minimization, substituting $R(t)$ for $\Tilde{R}(t) = \sum_{t'=t}^{T-1} \gamma^{t'-t} \Tilde{r} (s_{t'}, a_{t'})$. The rest of the derivation and algorithm remain the same.

We think this is an exciting direction, especially for learning all different possible representations of the worst-case experts.

Conclusion {#sec:conclusion}

We present APE as well as its mathematical formulation, and show that policy gradient, a basic RL algorithm can be used to optimize a policy ensemble that cannot be cloned. We evaluated APE against baselines to show that adversarial behaviour is not feasible without our method.

This work identifies a novel yet crucial area in Reinforcement Learning, regarding the confidentiality of proprietary policies. The essence of our approach is that a policy ensemble can achieve high return for the policy owner, while providing an external observer with a guaranteed low reward, making proprietary ensemble useless to the observer.

The formulation of our problem setup and the algorithm are very general. In this first work we demonstrate the solution in the deliberately chosen simple environments in order to better visualize the essence of our method. In our concurrent work we study thoroughly the application of our method in various domains, which is out of the scope of this introductory paper.

Acknowledgements

This work was supported in part by NSF under grant NRI-#1734633 and by Berkeley Deep Drive.

Appendix

Training Details & Hyperparameters {#sec:hyperparams}

For our training, we set $\alpha_\theta = 0.05$, and the value weight to be $0.5$, use annealed entropy regularization (@asynch) from $5 \text{e}-1$ to $5 \text{e}-3$ and set the discount factor $\gamma=0.99$. Due to the contrasting gradients experienced, large batch sizes are used. In our experiments, we take 1 gradient update of AdaM (@AdaM) per batch of 4096 (containing multiple trajectories), and trained for $3e6$ timesteps.

To estimate $p(c|s)$ in Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"}, we use a replay buffer that keeps track of the previous $60$ contexts seen at each state.

Estimating the quantity in Equation [eqn:observeRNN]{reference-type="ref" reference="eqn:observeRNN"} requires memory, which we use a single GRU (@GRU) as done in @LSH, with the exception that only states are fed in as a one-hot. Due to our environment is deterministic, state sequences captures the action sequence information. The single unit is then concatenated with the state, which feeds into a fully connected layer of 128, and then a soft-max, to produce the distribution $c | s$ over contexts.

For our behaviour cloning, we collect $1e6$ state action pairs, and train a tabular policy with $0.01$ learning rate on cross entropy softmax loss for $100$ epochs. The large amount of data and epochs is to ensure that we can recover $\pi_o$ with little to no variance.

To solve the precise returns of the policies, we inject noise of $1e-9$, to ensure a hitting time always exists from each state. As well, we clip all the hitting times to $T = 100$.

Estimating $\nabla_\theta \log \pi_o$ {#sec: appendEst}

It is not obvious how $\nabla_\theta \log \pi_o$ should be estimated, since $\pi_o$ is never realized until the policy is cloned. Literally, it is a virtual policy.

Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} offers a straightforward method to back-propagate, similar to that of the Mixture of Experts model (@mixexp), except using an estimate of $c | s$ instead of a gating network.

However, we can also rewrite Equation [eqn:observe]{reference-type="ref" reference="eqn:observe"} as $\sum_i p(c^{(i)}| s) \pi_{c^{(i)}}( a \mid s)

\mathop{\mathrm{\mathbb{E}}}{c \sim p(c | s)} [ \pi{c^{(i)}}( a \mid s)]$, which results in the gradient update being:

$$\begin{aligned} \label{eqn:exp_MC} \nabla_\theta \log \pi_o(a | s) = \nabla_\theta \log \mathop{\mathrm{\mathbb{E}}}{c \sim p(c | s)} [ \pi{c^{(i)}}( a \mid s)]\end{aligned}$$

which suggests a method of Monte Carlo sampling the inner expectation with $1$ sampled context. Empirically, we use the Monte Carlo sampling method.