Human-Aligned AI Summer School: A Summary

https://www.lesswrong.com/posts/bXLi3n2jrfqRwoSTH/human-aligned-ai-summer-school-a-summary

Contents

Value Learning (Daniel Filan)

Value Learning aims at infering human values from their behavior. Paul Christiano distinguishes ambitious value learning vs. narrow value learning:

Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) studies which reward best explains a behaviour. Two methods of IRL were discussed (the state-of-the-art builds on top of those two, for instance using neural networks):

Beyond Inverse Reinforcement Learning

The main problem of traditional IRL is that it does not take into account the deliberate interactions between a human and an AI (e.g. the human could be slowing down his behaviour to help learning). Cooperative IRL solves this issue by introducing a two-player game between the human and the AI, where both are rewarded according to the human’s reward function. This incentivizes the human to teach the AI his preferences (if the human only choses its best action, the AI would learn the wrong distribution). Using a similar dynamic, the off-switch game encourages the AI to allow himself to be switched off. Another adversity when implementing IRL is that the reward function is difficult to completely specify, and will often not capture all of what the designer wants. Inverse reward design makes the AI quantify his uncertainty about states. If the AI is risk-averse, it will avoid uncertain states, for instance situations where it believes humans have not completely defined the reward function because they did not know much about it.

Agent Foundations (Abram Demski)

Abram’s first talk was about his post "Probability is Real, and Value is Complex". At the end of the talk, several people (including me) were confused about the "magic correlation" between probabilities and expected utility, and asked Abram about the meaning of his talk. From what I understood, the point was to show a counter-intuitive consequence of choosing Jeffrey-Bolker axioms in decision theory over Savage axioms. Because Bayes’ algorithm can be formalized using Jeffrey-Bolker axioms, this counter-intuitive result challenges potential agent designs that would use Bayesian updates. The second talk was more general, and addressed several problems faced by embedded agents (e.g. naturalized induction).

Bounded Rationality (Daniel Filan /​ Daniel Braun)

To make sure an AI would be able to understand humans, we need to make sure it understands their bounded rationality, i.e. how sparse information and a bounded computational power limit rationality.

Information-Theoretic Bounded Rationality (Daniel Braun)

The first talk on the topic introduced a decision-complexity C(A|B) that expressed the "cost" of going from the reference B to the target A (proportional to the Shannon Information of A given B). Intuitively, it represents the cost in search process when going from a prior B to a posterior A. After some mathematical manipulations, a concept of "information cost" is introduced, and the final framework highlights a trade-off between some "information utility" and this "information cost" (for more details see here, pp. 14-18).

Human irrationality in planning (Daniel Filan)

Humans seem to exhibit a strong preference in planning hierarchically, and are "irrational" in that sense, or at least not "Boltzmann-rational" (Cundy & Filan, 2018). Hierarchical RL is a framework used in planning that introduces "options" in Markov Decision Processes where Bellman Equations still hold. State-of-the-art methods in Hierarchical RL include meta-learning of the hierarchy or a two-modules neural network.

Side effects (Victoria Krakovna)

Techniques aiming at minimizing negative side effects include minimizing unnecessary disruptions when achieving a goal (e.g. turning Earth into paperclips) or designing low-impact agents (avoiding large side effects in general). To correctly measure impact, several questions must be answered:

Comment

https://www.lesswrong.com/posts/bXLi3n2jrfqRwoSTH/human-aligned-ai-summer-school-a-summary?commentId=WvupWo6aZKP3pAX4N

Thanks for summary of some of the talks! Just to avoid some unnecessary confusion, I’d like to point out the name of the event was Human-aligned AI Summer School. A different event, AI Safety Camp, is also happening in Prague, in October. While there is a substantial overlap between both organizers and participants, the events have somewhat different goals, are geared toward slightly different target audiences. The summer school is pretty much in the format of an "academic summer school", where you have talks, coffee breaks, social events, and similar structured program, but usually not something like substantial amount of time to do your own independent research. The camp is the complement—lot of time to do independent research, not much structured program, no talks by senior researchers, no coffee breaks and also no university backing. Maybe, at some point we may try some mixture, but now there are large differences. It is important to understand them and have different expectations from each event.

Comment

https://www.lesswrong.com/posts/bXLi3n2jrfqRwoSTH/human-aligned-ai-summer-school-a-summary?commentId=pnaYvEgSaLAL3CvfF

I agree that the "Camp" in the title was confusing, so I changed it to "Summer School". Thank you!

Comment

Written as "Human-Aligned Summer School", I first read it as an educational experiment aimed at not making kids suffer. For some reason I find the misinterpretation hilarious.

Comment

Added "AI" to prevent death from laughter.

https://www.lesswrong.com/posts/bXLi3n2jrfqRwoSTH/human-aligned-ai-summer-school-a-summary?commentId=54ecCHkwyFZgAv65K

Typo: Cundy and Filan, 2018 (not 2008)