2020 AI Alignment Literature Review and Charity Comparison

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison

Contents

Introduction

As in 2016, 2017, 2018, and 2019, I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments. My aim is basically to judge the output of each organisation in 2020 and compare it to their budget. This should give a sense of the organisations’ average cost-effectiveness. We can also compare their financial reserves to their 2020 budgets to get a sense of urgency. I’d like to apologize in advance to everyone doing useful AI Safety work whose contributions I have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) other projects, 2) the miracle of life and 3) computer games. This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues which I attempt to cover but only very cursorily.

How to read this document

This document is fairly extensive, and some parts (particularly the methodology section) are largely the same as last year, so I don’t recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. Papers listed as ‘X researchers contributed to the following research lead by other organisations’ are included in the section corresponding to their first author and you can Cntrl+F to find them. If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories. Here are the un-scientifically-chosen hashtags:

New to Artificial Intelligence as an existential risk?

If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper, or for a more technical version this by Richard Ngo. If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinheart, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation, or this from Critch & Krueger, or this from Everitt et al, though it is a few years old now

Research Organisations

FHI: The Future of Humanity Institute

FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom. They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here. Their research is more varied than MIRI’s, including strategic work, work directly addressing the value-learning problem, and corrigibility work—as well as work on other Xrisks. They run a Research Scholars Program, where people can join them to do research at FHI. There is a fairly good review of this here. Unfortunately I suspect the pandemic may have reduced its effectiveness this year, as FHI has often favoured informal networking rather than formal management structures, but it seems to have worked well pre and hopefully post pandemic. The EA Meta Fund supported a special program for providing infrastructure and support to FHI, called the Future of Humanity Foundation. This reminds me somewhat of what BERI does. In the past I have been very impressed with their work. Research Bostom & Shulman’s Sharing the World with Digital Minds discusses the moral issues raised by the potential for uploads or other digital minds. By virtue of their number, speed, or specific design, these could be utility monsters—a term from Nozick for agents much more efficient than humans at turning resources into utility. Would we therefore be obliged to give up all our resources to them and eventually let meat humanity starve to death? This much has been discussed before—indeed, I alluded to this as an argument against a universal basic income as a response to AI-driven unemployment in previous versions of this article! - but this article both provides a canonical reference and also a good survey showing that such issues come up under a wide variety of ethical views and technological possibilities. I also enjoyed the discussion of the issues posed by rapid reproduction for ‘democratic’ political systems, where influence is the scarce resource. #Strategy Ashurst et al.‘s A Guide to Writing the NeurIPS Impact Statement gives advice on how to write the new ‘impact statements’ that NeurIPS now requires. Seizing this gap in the market by writing the canonical piece that everyone will find when they google—my tests suggest they have the SEO—and filling it with a counterfactually valuable article is some good out-of-the-box thinking. As well as containing many very useful links, I liked the suggestion that even theoretical pieces should consider their impacts. #Misc Kovařík & Carey’s (When) Is Truth-telling Favored in AI Debate? provides some formalism and theorems around the properties of debate. I thought the section about debate length was very interesting, where it seems to show (at least for this class of debate) that debates are either long enough to produce the truth in a trivial manner (through full exposition) or else error can be arbitrarily high with even one fewer step, though they also identified plausible seeming sub-classes with much better performance. (the paper is technically from the very end of 2019 but I missed it last year) See also the discussion here. #Amplification Shevlane & Dafoe’s The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse? discusses whether increased AI publishing will generally be more useful for ‘attack’ or ‘defence’. They argue that the ‘publishing exploits is generally best practice (with a lag)’ model from cybersecurity might not be best placed here—an important argument to rebut, as many people used it to criticise OpenAI’s decision to be (initially) clopen with regard GPT-2. #Strategy Ord’s The Precipice provides a detailed overview of existential risks and the future of humanity. It covers a variety of risks, including a good section on AGI, which Toby estimates as the largest risk at ~ 10% /​ century. There is also a huge amount of other material covered, including some novel ideas to me like the section on risk correlations, as well as some very motivational final chapters. I was pleasantly surprised to learn that 80% of DNA synthesis was being screened (in some way) for dangerous compounds. Probably replaces Bostrom and Ćirković as the best book on the subject now. #Overview Carey et al.’s The Incentives that Shape Behaviour attempts to build a general theory of what sort of incentives lead agents to manipulate humans. This is basically causal diagram classification, revealing incentives to control and react to humans. It includes examples for both fairness incentives and also a possible way of reducing human manipulation incentivisation: optimising for a separately trained predictor. See also the discussion here. Researchers from Deepmind were also named authors on the paper. #AgentFoundations Clarke’s Clarifying "What failure looks like" (part 1) attempts a more detailed analysis of the issues raised in Christiano’s What failure looks like I liked the breakdown of lock-in mechanisms, which seem true to me.It provides a lot of examples, some of which I liked, like that of the Maori. However many of them were sufficiently simplified that I feel significant disanalogies were overlooked—for example, the Climate Change example neglects the very different incentives facing regulated utilities, and the agricultural revolution example seems to require a strong commitment to average utilitarianism, even though this is not a popular view of population ethics. Despite this I thought the underlying argument seemed pretty plausible. #Forecasting Armstrong et al.’s Pitfalls of Learning a Reward Function Online introduces two desirable properties for agents who are trying to learn human values at runtime (unriggability and uninfluenceability) and proves they are broadly the same thing. As well as proving this result, it contains a series of examples of what can go wrong in the absence of either property—including sacrificing reward with probability 100% - and a brief discussion of how counterfactual rewards might address the problem. It ends with an extended gridworld example, but I found this a little hard to follow. See also the discussion here. Researchers from Deepmind were also named authors on the paper. #ValueLearning Tucker et al.’s Social and Governance Implications of Improved Data Efficiency discusses some of the strategic implications of ML systems that do not require as much data. They argue that it is not obvious that they will net benefit smaller firms—if the impact is multiplicative, it might benefit larger firms with more compliments (like market access) more—though I am not sure a multiplicative effect is really a good model for what people are thinking about when they talk about ML models needing less data. They also point out that due to threshold effect this might enable entirely new applications, and in particular IRL/​amplification, as these rely on a very scarce source of data: humans. #Forecasting Cohen & Hutter’s Curiosity Killed the Cat and the Asymptotically Optimal Agent show that because any agent that is guaranteed to eventually find the optimal strategy can only do so by testing every option, any ‘traps’ in the environment will eventually be triggered with probability 1. (Unless traps are disabled after finite time). This is clearly kinda important—it is nice to be able to reason about asymptotic optimality, but we do not want an AGI that deletes humanity with p=1 en route. This suggests something of a bootstrap problem, where we need a ‘mentor’ to avoid such dangers. Researchers from Deepmind were also named authors on the paper. #RL Cohen & Hutter’s Pessimism About Unknown Unknowns Inspires Conservatism basically tries to make a conservative AIXI that defers to its mentor when it is not sure. It does this by comparing its worst-case estimates to its estimate of the mentor’s expected case, and defers to the mentor more when the difference is higher (and less as t->oo). Hopefully the mentor will help keep the agent from being too conservative, as it seems there is a risk that it simply ends up doing nothing, and gets out-competed by an EV maximising agent? Researchers from Deepmind were also named authors on the paper. #RL Nguyen & Christiano’s My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda provides an overview of Paul’s IDA agenda. Probably the best such explanation so far; written by Chi when she was at FHI with in-line comments from Paul. Researchers from OpenAI were also named authors on the paper. #Amplification Snyder-Beattie et al.’s The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare builds a bayesian model to try to get around the anthropic problem of estimating how easy it is for life to develop. Specifically, they use non-informative priors and update based on the distribution of various transitions (e.g. Eukaryotes), concluding (similar to previous work they cite) that the development of life is relatively hard. See also the discussion here. #Forecasting Ding & Dafoe’s The Logic of Strategic Assets: From Oil to AI analyses what causes a product to be ‘strategic’ to a country. They decompose this into the product of its Importance, Externalities and Rivalosity, in contrast to previous analysis of simply ‘military importance’. Some of the examples I might quibble with—for example, the paper claims that the spillovers from railways lead private agents to underinvest, which is somewhat in tension with the experience of the railway bubbles. I am also a bit sceptical that this analysis really subsumes the idea of dependency-strategic items—nitrates in WWI, and nuclear weapons now, both lack substitutes and are at risk of supply disruptions, but neither really seem to have massive externalities. It also would have been nice to see some analysis of why individual firms do not internalise the risk of supply disruption—is this due to anti-price gouging laws? It finishes with detailed discussion of two examples—British Jet Engines (reminding me of Attlee’s disastrous mistake with another type of engine ) and US-Japanese rivalry. The report discusses several mistakes US policy made during this period—e.g. accidentally classifying cash registers as strategic, and missing rayon fibers—but these mistakes seem like they are adequately explained without the theory put forward by the paper. #NearAI Cotton-Barratt et al.’s Defence in Depth Against Human Extinction: Prevention, Response, Resilience, and Why They All Matter provides a series of taxonomies for existential risks. In particular, they discuss distinctions between preventing and mitigating events, how events scale to be global, and how direct their effect is. See also the discussion here. #Strategy Cihon et al.’s Should Artificial Intelligence Governance be Centralised? Design Lessons from History discusses the advantages of centralised or fragmented international law approaches to AI. Most of the considerations are not AI specific. Researchers from CSER were also named authors on the paper. #Strategy O’Brien & Nelson’s Assessing the Risks Posed by the Convergence of Artificial Intelligence and Biotechnology discusses the impact of AI on biorisk. They first discuss the problems with several existing frameworks and the potential impact of AI on bio risk, before offering their own framework. #OtherXrisk Cremer & Whittlestone’s Canaries in Technology Mines: Warning Signs of Transformative Progress in AI attempt to identify possible signs of imminent AGI though expert-solicitation of causal influence diagrams. Basically a technology that is seen as a prerequisite for many others is a candidate for being a canary. However, I didn’t feel the paper really addressed the issues raised in Eliezer’s Fire Alarm post. Researchers from CSER were also named authors on the paper. #Forecasting O’Keefe’s How will National Security Considerations affect Antitrust Decisions in AI? An Examination of Historical Precedents surveys a bunch of historical antitrust actions in the US to see how national security arguments played into the outcome. He finds that it was pretty rare, especially recently, and when it did it was generally congruent with the main antitrust objectives, namely preventing artificial reductions in output. The idea here presumably is to suggest that the US government is unlikely to use antitrust as a tool in an AI race unless firms start overcharging for their services. O’Keefe also lists support from OpenPhil. #Politics Bostrom et al.’s Written Evidence to the UK Parliament Science & Technology Committee’s Inquiry on A new UK research funding agency. recommends that Cumming’s new British DARPA focus on existential risks. I think this a worthwhile but big ask—DARPA seems more intended to fund risky things than to reduce risk—and now Cummings has left I worry the window for intervention here may have passed. Researchers from CSER were also named authors on the paper. #Politics O’Keefe et al.’s The Windfall Clause: Distributing the Benefits of AI for the Common Good proposes that AI firms voluntarily commit to donating some % of profits over a high threshold to humanity in general. The idea is that the cost of this commitment is currently negligible, but would be extremely socially valuable if one firm gained a decisive strategic advantage. I think it’s good to work on novel governance strategies, but I’m not very enthusiastic about this specific option, partly for reasons I outlined in lengthy but unfinished comments on the forum post, but mainly because I don’t think it does much to reduce the existential risk, especially vs similar ideas like encouraging consolidation among AI firms. See also the discussion here. #Politics Garfinkel’s Does Economic History Point Towards a Singularity? and the associated document analyse the claim that economic growth has been accelerating in accordance with global GDP (or population). In general it finds the evidence for this to be somewhat weak. #Forecasting Prunkl & Whittlestone’s Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society proposes alternative divisions of the AI safety community other than near vs long term. These are: impacts, capabilities, certainty and scale. The paper argues that we should focus on these axis because 1) there is variance that is overlooked by a single short-vs-long axis and 2) this can cause misunderstandings. I did not really find this convincing: the purpose of any clustering is to summarize data, and I have yet to come across any examples of confusions that would be dispelled by their alternative axes. In fact, their motivating example—that of Etzioni’s misreading of Bostrom—is a case where relying on the ‘long term’ stereotype would have given Etzioni more accurate beliefs! Similarly, their examples of ‘intermediate’ issues, like the long-term impact on inequality of algorithmic discrimination, seems to me like precisely the sort of political (and in my opinion mistaken) concern that everyone would agree falls into the ‘short-term’ camp. But perhaps, like Cave & Ó hÉigeartaigh, this paper is better understood as a speech act. See also the discussion here. Researchers from Leverhulme were also named authors on the paper. #Strategy FHI researchers contributed to the following research led by other organisations:

CHAI: The Center for Human-Aligned AI

CHAI is a UC Berkeley based AI Safety Research organisation founded in 2016 by Stuart Russell.. They do ML-orientated safety research, especially around inverse reinforcement learning, and cover both near and long-term future issues. One outside interpretation of their work from Alex Flint is here. As an academic organisation their members produce a very large amount of research; I have only tried to cover the most relevant below. It seems they do a better job engaging with academia than many other organisations, especially in terms of interfacing with the cutting edge of non-safety-specific research. The downside of this, from our point of view, is that not all of their research is focused on existential risks. Rohin Shah, now with additional help, continues to produce the AI Alignment Newsletter, covering in detail a huge number of interesting new developments, especially new papers. I really cannot praise these newsletters highly enough. Unfortunately for CHAI, but probably fortunately for the world, he has graduated and is moving to Deepmind. They have expanded somewhat to other universities outside Berkeley and have people at places like Princeton and Cornell. Research CHAI and their associated academics produce a huge quantity of research. Far more so than other organisations their output is under-stated by my survey here; if they were a small organisation that only produced one report, there would be 100% coverage, but as it is this is just a sample of those pieces I felt most interested in. On the other hand academic organisations tend to produce some slightly less relevant work also, and I have focused on what seemed to me to be the top pieces. Critch & Krueger’s AI Research Considerations for Human Existential Safety (ARCHES) is a super-detailed overview of the state of the field, and a research agenda. It provides a detailed explanation of key concepts and a categorisation schema of various possible scenarios, including new distinctions I hadn’t seen clearly made before. This is a mammoth document, and I encourage the reader to attempt it if possible. A few interesting points for me were his argument that AI reseacher’s discussions of ‘near’ AI problems as being the first steps towards admitting problems, or that Distributional Shift work might not be not neglected by Industry? Contrary to some others he argues that we should perhaps never make ‘prepotent’ AI (one that cannot be controlled by humans) - not even a defensive one to prevent other AI threats. There is also a lot of discussion of multi-polar scenarios—the idea that single agent alignment/​delegation problems are less important to focus on, partly because the single-agent version is more likely to be solved by profit-maximising firms. See also the discussion here. Researchers from BERI were also named authors on the paper. #Overview Andreea et al.‘s LESS is More: Rethinking Probabilistic Models of Human Behavior attempts to extend the model of Boltzmann rationality (where humans choose the best option, with noise, from a finite menu) to the continuous case. This is essentially by providing continuous measures of how ‘similar’ different options are, to show that e.g. driving at 41mph and 41.1mph are basically the same thing. #IRL Christian’s The Alignment Problem: Machine Learning and Human Values is a heavier-than-pop-sci book introduction to near and long-term AI issues. It does a good job connecting short-term worries (first part of book) to the bigger longer-term issues (second part of book), tying them together in multiple ways, and the scholarship seems very good. I enjoyed reading. #Overview Critch’s Some AI research areas and their relevance to existential safety describes Critch’s views on a variety strategic research landscape questions. It contains some interesting ideas, like technical progress legitimising governance demands by making them credibly achievable. More importantly is the detailed and sophisticated analysis of each of these research areas in terms of their value and neglectedness. Notably for me were the sections arguing that research areas I have historically thought of as being pretty core to reducing AI X-risk, like Agent Foundations and Value Learning, as being not very useful, as well as a very positive view of studying Human-Robot interaction. However, I think it is a little credulous with regard to many near AI safety issues like fairness, to the point of supporting GDPR because more regulation is desirable, regardless of whether that regulation is good. #Strategy Gleave et al.‘s QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS introduces a distance metric for reward functions. This allows us to judge whether two reward functions are ‘the same’ - at least relative to a certain environment. They might differ in a larger environment, as this pseudo-metric is weaker than utility functions’ being identical up to an affine transformation. It might be useful as a measure of how accurately RL agents have learnt the intended reward Researchers from Deepmind were also named authors on the paper. #RL Reddy et al.’s Learning Human Objectives by Evaluating Hypothetical Behavior attempts to learn safely by using hypothetical scenarios. Basically prior to letting the RL agent run around in the environment and potentially act unsafely, they procedurally generate hypotheticals in various ways and have the humans give feedback on them, so the agent can pre-learn before being let loose on the real environment. See also the discussion here. Researchers from Deepmind were also named authors on the paper. #IRL Freedman et al.’s Choice Set Misspecification in Reward Inference introduces and analyses the implications of an IRL agent which has mistaken beliefs about its teacher’s choice set. The obvious consequence would be assigning a low value on something that the human appears to have decided against—when it was actually inaccessible. The paper breaks this down into different cases, and shows (somewhat unsurprisingly) that the harm this does can vary from negligible to maximal. In some scenarios it is even helpful, by preventing an imperfectly rational human from mistakenly choosing a sub-optimal choice during training. #IRL Shah’s AI Alignment 2018-19 Review is a huge overview of AI alignment work from the prior two years. If you want to survey what people have been working on (as opposed to determining which organisations are best to donate to) this post is an excellent resource. #Overview Russel & Norvig’s Artificial Intelligence: A Modern Approach, 4th Edition is the latest version of the famous textbook. It contains a chapter on AI ethics and safety, as previous editions did. The chapter is mainly focused on ‘near’ AI issues like discrimination; while it does provide an overview of some of the issues and techniques in AI alignment work, it doesn’t really make the case for why this is so vitally important. #Textbook Halpern & Piermont’s Dynamic Awareness presents a version of modal logic for logical uncertainty. Specifically, agents becoming ‘aware’ of propositions they had not previously considered. #AgentFoundations CHAI researchers contributed to the following research led by other organisations:

MIRI: The Machine Intelligence Research Institute

MIRI is a Berkeley based independent AI Safety Research organisation founded in 2000 by Eliezer Yudkowsky and currently led by Nate Soares. They were responsible for much of the early movement building for the issue, but have refocused to concentrate on research for the last few years. With a fairly large budget now, they are the largest pure-play AI alignment shop. Their research can be found here. Their annual summary can be found here. In general they do very ‘pure’ mathematical work, in comparison to other organisations with more ‘applied’ ML or strategy focuses. I think this is especially notable because of the irreplaceability of the work. It seems quite plausible that some issues in AI safety will arise early on and in a relatively benign form for non-safety-orientated AI ventures (like autonomous cars or Minecraft helpers) – however the work MIRI does largely does not fall into this category. I have also historically been impressed with their research and staff. Their agent foundations work is basically trying to develop the correct way of thinking about agents and learning/​decision making by spotting areas where our current models fail and seeking to improve them. This includes things like thinking about agents creating other agents. MIRI, in collaboration with CFAR, runs a series of four-day workshop/​camps, the AI Risk for Computer Scientists workshops, which gather mathematicians/​computer scientists who are potentially interested in the issue in one place to learn and interact. This sort of workshop seems very valuable to me as an on-ramp for technically talented researchers, which is one of the major bottlenecks in my mind. In particular they have led to hires for MIRI and other AI Risk organisations in the past. I don’t have any first-hand experience however, and presumably these were significantly suppressed by the pandemic. They also support MIRIx workshops around the world, for people to come together to discuss and hopefully contribute towards MIRI-style work. MIRI continue their policy of nondisclosure-by-default, something I’ve discussed in the past, which despite having some strong arguments in favour unfortunately makes it very difficult for me to evaluate them. I’ve included some particularly interesting blog posts some of their people have written below, but many of their researchers produce little to no public facing content. They are (were?) also apparently considering leaving the bay area, which I think I would consider positively. edit 2020-12-25: after publishing this article, MIRI posted this blog post explaining they were embarking on a significant change of direction as they felt their post-2017 primary research direction, working on fundamental agent foundation ‘deconfusion’, was not making much progress. Some staff members will be leaving as a result. It is not clear to what extent they will disclose their new research directions. I haven’t had time to fully internalise this news, so leave the link for the reader to evaluate. Research Most of their work is non-public. Here are three forum posts from the last year by staff that I thought were insightful. Hubinger’s An overview of 11 proposals for building safe advanced AI examines eleven different strategies for AI safety. It evaluates these on how promising they are for both the inner and outer alignment problems, as well as competitiveness—it is no good producing a 100% safe system if someone else out-competes you with a more risky one. This is the first post I’ve seen of this type and it does a great job. #Overview Garrabrant’s Cartesian Frames is a sequence of posts putting forward a new way of thinking, and associated mathematical formalism, about agency. The idea is to move away from dualistic AIXI style models, where the agent is outside the world, towards a system where we can examine different ‘framings’, each of which suggest a different thing as being agent-like—being able to make choices. This sensible philosophical motivation is then associated with a lot of category theory formalism, allowing you to do things like combining agents, decomposing agents, etc. #AgentFoundations Abram Demski’s Radical Probabilism presents a non-bayesian (ish) alternative account of probability. It is designed to take into account non-certain evidence, and allow for less rigid updating rules—in particular the fact that we can learn from thinking, not just from new sense data. I really enjoyed the dialogues, where I think the foil did a good job of presenting the objections I wanted to make. At the end of it I’m still not convinced what I think though—it seems a little unfair to compare a fully specified system, whose problems are easy to point out, with a somewhat hypothetical replacement. #AgentFoundations According to Riedel & Deibel, over the 2016-2020 period, MIRI came in third for the number of citations in technical AI safety. Finances They spent $6,050,067 in 2019 and $7,500,000 in 2020, and plan to spend around $6,500,000 in 2021. They have around $13m380,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 2.1 years of runway. 2020 spending was above plan; most orgs spent less due to the pandemic, but MIRI invested in sub-quarantine live/​work spaces outside Berkeley so researchers could still benefit from in-person collaboration. They have been supported by a variety of EA groups in the past, including OpenPhil. They are not running a formal fundraiser this year but apparently would still welcome donations; if you wanted to donate to MIRI, here is the relevant web page.

GCRI: The Global Catastrophic Risks Institute

GCRI is a globally-based independent Existential Risk Research organisation founded in 2011 by Seth Baum and Tony Barrett. They cover a wide variety of existential risks, including artificial intelligence, and do policy outreach to governments and other entities. Their research can be found here. Their annual summary can be found here. In 2020 they continued their advising program where they gave guidance to people from around the world who wanted to help work on catastrophic risks. In 2020 they hired McKenna Fitzgerald as Project Manager and Research Assistant. Research Baum’s Accounting for violent conflict risk in planetary defense decisions discusses the impacts and lessons from asteroid defence for other Xrisks, mainly nuclear war. It contains some interesting history about how congress came to care about asteroid defence—including that popular movies, while inaccurate, where quite helpful, and that many astronomers were relatively opposed. It also points out that using nuclear weapons or similar against an asteroid would probably be in violation of international law. Presumably in a disaster scenario the US would simply ignore this, but it might make preparation and practice ahead of time more difficult. #OtherXrisk Baum’s Quantifying the Probability of Existential Catastrophe: A Reply to Beard et al. responds to the CSER paper. It makes some methodological points, like about the importance of different thresholds for what constitutes a catastrophe, and ways in which this forecasting could be improved. See also the discussion here. #Forecasting Baum’s Artificial Interdisciplinarity: Artificial Intelligence for Research on Complex Societal Problems discusses how AI could be used to aid research that joined multiple fields of research. For example, relatively basic AI could improve search engines by improving synonym handling, whereas more advanced AI could summarise papers. #NearAI Baum’s Medium-Term Artificial Intelligence and Society introduces the idea of Medium-Term AI risks. It argues these could be a unifying issue for those worried about near and long term risks. #NearAI According to Riedel & Deibel, over the 2016-2020 period, GCRI accounted for the second largest number of citations for meta-AI-safety work. Finances They spent $250,000 in 2019 and $300,000 in 2020, and plan to spend around $400,000 in 2021. They have around $600,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.5 years of runway. However, they tell me that for their core operations runway is close to one year, while the runway for external collaborators is longer. If you want to donate to GCRI, here is the relevant web page.

CSER: The Center for the Study of Existential Risk

CSER is a Cambridge based Existential Risk Research organisation founded in 2012 by Jaan Tallinn, Martin Rees and Huw Price, and then established by Seán Ó hÉigeartaigh with the first hire in 2015. They are currently led by Catherine Rhodes and are affiliated with Cambridge University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach, including to the UK and EU parliaments—e.g. this. Their research can be found here. Their half-yearly review can be found here. They took on a number of new staff in 2020, most notably John Burden, Jess Whittlestone and Matthijs Maas. Jess joins from Leverhulme where I think she produced some of their best work. Research Beard et al.‘s An Analysis and Evaluation of Methods Currently Used to Quantify the Likelihood of Existential Hazards surveys a range of possible techniques for estimating the probability of different existential risks. They then score these on four criteria, and find that no method does well on all. The document contains a number of interesting points, including on the extreme dispersion in some estimates like Supervolcanoe. It also alludes to the use of ‘bad, or even discredited’ techniques being used in the existential risk community—this is a case where I wish they had named and shamed! #Forecasting Belfield’s Activism by the AI Community: Analysing Recent Achievements and Future Prospects reviews the prospects for successful activism by AI employees. It firstly reviews their historical successes, and then uses two different frameworks (as an epistemic community like scientists, and as workers) to analyse the issue, and concludes that AI workers are likely to continue to have significant power to change things through activism. I think this is basically true—my model for grand success runs basically through convincing this epistemic community. One thing the paper does not discuss is the question of getting the AI community to care about the right things though! #Strategy Belfield et al.’s Response to the European Commission’s consultation on AI recommends the EU pass strict rules about AI. These largely cover more near term issues, and there is no explicit mention of catastrophic risks (that I noticed) but some could be long-run beneficial. The response generally seems written in a way that would appeal to policymakers. I wonder if part of the subtext is making EU AI deployment sufficiently arduous as to slow down AI progress (they deny this!). Researchers from Leverhulme were also named authors on the paper. #Politics Beard et al.’s Existential risk assessment: A reply to Baum responds to the GCRI response to their earlier paper. #Forecasting hÉigeartaigh et al.’s Overcoming Barriers to Cross-cultural Cooperation in AI Ethics and Governance discusses and advocates for international collaboration on AI safety. The lengthy discussion includes some interesting points about misconceptions and the prospects for common agreements in the presence of very different value systems, but is mainly an imperative piece rather than an analytical one. It focuses on Sino-American cooperation; three of the coauthors are Chinese. Researchers from Leverhulme were also named authors on the paper. #Politics Beard & Kaczmarek’s On the Wrongness of Human Extinction rebuts and argument that extinction would not be bad because non-existant people cannot be harmed. In particular they argue we wrong such future people by failing to benefit them, even though they have not been harmed. To the extent that responding to such arguments helps motivate people to prevent extinction this is a useful thing to do. (I guess if Extinction was actually good that would be good to know too as we could all stop working so hard!) #Ethics Avin et al.‘s Exploring AI Futures Through Role Play describes a series of war games the authors ran about future AI development. This definitely a cool idea—I suspect I would enjoy taking part, and their sign-up sheet seems to be still live—and historically these exercises have proved useful in war, like the (in)famous Millenium Challenge 2002. However, I am a bit skeptical of how much insight these particular games have produced—many of the conclusions (e.g. cooperation is important to produce a good outcome) seem both non-novel and also something that was plausibly ‘fed into’ the structure of the game. I am always a little suspicious of ideas that seem too much like fun! #Forecasting Tzachor et al.’s Artificial intelligence in a crisis needs ethics with urgency discusses near-term AI risks related to the pandemic. It mentions things like fairness and privacy, but doesn’t really have any specific examples of AI related problems, which aligns with my feeling that our pandemic response would have been better with less restrictions (e.g. our contract tracing could have been better without HIPAA). The intention appears to be to use this to establish an AI regulatory board to oversee novel techniques in the future. Researchers from Leverhulme were also named authors on the paper. #NearAI Kemp & Rhodes’s The Cartography of Global Catastrophic Risks surveys the sorts of international governance structures for various Xrisks. #Politics Burden & Hernandez-Orallo’s Exploring AI Safety in Degrees: Generality, Capability and Control argues for decomposing the risk of an AI agent into its Capabilities, Generality and our degree of Control. It suggests using Agent Characteristic Curves for this, and includes a toy example. Note that I think the lead author had not technically started at CSER when he wrote the paper. Researchers from Leverhulme were also named authors on the paper. #Capabilities They also did work on various non-AI issues, which I have not read, but you can find on their website. CSER researchers contributed to the following research led by other organisations:

OpenAI

OpenAI is a San Francisco based independent AI Research organisation founded in 2015 by Sam Altman. They are one of the leading AGI research shops, with a significant focus on safety. Initially they planned to make all their research open, but changed plans and are now significantly more selective about disclosure—see for example here. One of the biggest achievements is GPT-3, a massive natural language algorithm that generates highly plausible continuations from prompts, which seems to be very versatile. Scott and Gwern managed to get GPT-2 to play chess, and see also other GPT-3 work by Gwern here, including a to my mind convincing refutation of Gary Marcus’s criticisms (here). The Guardian published an article in which GPT-3 argued that AGI was not a threat to humanity; the article is not very much less convincing than is typical for such arguments. Research Christiano’s "Unsupervised" translation as an (intent) alignment problem introduces translation between two languages where no mutual text exists as an analogy for advanced systems. This task seems do-able for a sufficiently advanced AI (I think, though probably some philosophers of language would disagree), but it would be very hard for humans to understand what was going on or to stay ‘in-the-loop’. #Transparency Brown et al.’s Language Models are Few-Shot Learners paper examines what happens to GPT-3’s ability to learn a new task with very few examples when you massively increase the number of parameters. Essentially the idea is that as the number of parameters and number of co-authors gets large enough, it gains something like general purpose intelligence, which then allows it to learn new tasks with very few examples—like a human can. Performance on some of these tasks could even beat specially-trained models. The paper also has a detailed and professional section on potential for misuse in various near AI problems. #GPT-3 Barnes & Christiano’s Writeup: Progress on AI Safety via Debate summarises OpenAI’s attempts to design mechanisms to allow non-experts to safely extract information for unaligned experts. It describes various problems they came across, like the deceptive use of ambiguity, or frame control, and their corrections to the mechanism design, like the addition of ‘cross-examination’. Cross examination basically forces consistancy, and they analogise this to expanding the computational complexity class, but it is not clear how desirable this is—it seems intuitively to me like making something that worked locally with subgames would be ideal. I particularly liked the discussion of their iteration method, rather than just presenting the ‘final’ product sui generis. #Amplification Brundage et al.’s Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims describes a variety of ways to promote third-party verifiability of AI systems. This includes coding it into the AI (ideas like interpretability that we often discuss), hardware elements, and institutional reforms, like public bounties for people who find bugs. One of the most noteworthy parts of the document is the wide range of institutions represented in the author list, including many universities around the world. Researchers from FHI,CSER,Leverhulme,CSET were also named authors on the paper. #Strategy Stiennon et al.‘s Learning to Summarize with Human Feedback trains a model for writing short text summaries based on human feedback. It first trains a reward model with supervised learning, and then uses that to train an RL agent. They invested in higher-than-usual quality feedback (hourly rate contractors vs Mturkers) and successfully produced summaries of Reddit posts and Daily Mail articles that were on average higher quality than the human written ones (though the latter were hardly Shakespeare). It is basically attempting to produce ‘approved by humans’ output, instead of just GPT-3 style ‘looks like human written’ - including testing how hard you can optimise for a proxy before you start getting perverse effects. I also liked the point that the model picked up that the reviewers liked longer summaries (similar to how Reddit likes EffortPosts?). #ValueLearning Henighan et al.’s Scaling Laws for Autoregressive Generative Modeling examines how transformer performance scales with compute in various cases. They find generally pretty similar and smooth relationships in multiple domains, implying a lack of (near) upper bound, and suggest that on the margin bigger models are more worth the computational effort than training smaller ones for longer. #Capabilities OpenAI Researchers also contributed to the following papers lead by other organisations:

Google Deepmind

Deepmind is a London based AI Research organisation founded in 2010 by Demis Hassabis, Shane Legg and Mustafa Suleyman and currently lead by Demis Hassabis. They are affiliated with Google. As well as being arguably the most advanced AI research shop in the world, Deepmind has a very sophisticated AI Safety team, covering both ML safety and AGI safety. I won’t cover their non-directly-safety-related work in detail, but one highlight is that this year Deepmind announced they had made significant progress on the Protein Folding problem with their AlphaFold architecture. While there’s still a ways to go yet before we can use it to build arbitrary proteins, this is clearly a big step forward, and shows the generality of their approach. See also discussion here. Long-time followers of the space will recall this is a development Eliezer highlighted back in 2008. See also this very interesting speculation that Deepmind’s team-based private sector approach gave them a significant advantage over academia, and that their speed helped limit knowledge diffusion. They also produced this work on one-shot object naming learning in a physical environment—so rather than having to show the agent a huge number of pictures of cows for it to learn what a cow is, it successfully learns new object names based on a very small number of samples. See also discussion here. Jan Leike left Deepmind in June. Research Krakovna et al.’s Specification gaming: the flip side of AI ingenuity is basically an introduction, with many examples, to the problem of AIs producing solutions you did not expect—or want. It discusses both failures of reward shaping as well as AIs manipulating the rewards. #ValueLearning Gabriel’s Artificial Intelligence, Values and Alignment discusses the alignment problem from various philosophical perspectives. It makes some novel (at least to me) points, like the way that technical AI design may render some ethical systems unobtainable—for example, an optimiser that does not think in terms of ‘reasons’ is unacceptable to the extent that Kantian deontology is the case. The connection between IRL and virtue ethics was also cute. Overall I thought it was a quite sophisticated treatment of the subject. #Ethics Krakovna et al.’s Avoiding Side Effects By Considering Future Tasks proposes a method for reducing side effects. We specify a default policy, and then penalise the agent for restricting our future options relative to that default policy. This helps avoid the risk of e.g. the agent being incentivised to undermine the human’s attempts to shut it down. #Corrigibility Uesato et al.’s Avoiding Tampering Incentives in Deep RL via Decoupled Approval addresses the problem of agents messing with their value functioning (by e.g. setting utility=IntMax in their params file) by querying a human for reward with regard actions other than those taken. They need to make some assumptions about the structure of the corruption that seem not obvious to me, but it seems like a cool idea. On my reading it doesn’t strongly disincentive tampering—it just fails to reward it—which is still an improvement. They back this up with some toy models. #ValueLearning Researchers from Deepmind were also named on the following papers:

BERI: The Berkeley Existential Risk Initiative

BERI is a (formerly Berkeley-based) independent Xrisk organisation, founded by Andrew Critch but now led by Sawyer Bernath. They provide support to various university-affiliated (FHI, CSER, CHAI) existential risk groups to facilitate activities (like hiring engineers and assistants) that would be hard within the university context, alongside other activities—see their FAQ for more details. As a result of their pivot they are now essentially entirely on providing support to researchers engaged in longtermist (mainly x-risk) work at universities and other institutions. In addition to FHI, CSER and CHAI they added six new ‘trial’ collaborations in 2020, and intend to do more in 2021. Here are the 2020 cohort:

Ought

Ought is a San Francisco based independent AI Safety Research organisation founded in 2018 by Andreas Stuhlmüller. They research methods of breaking up complex, hard-to-do tasks into simple, easy-to-do tasks—to ultimately allow us effective oversight over AIs. This includes building computer systems, and previously also recruiting test subjects. Their research can be found here. Their annual summary (sort of) can be found here. In the past they were focused on factored generation – trying to break down questions into context-free chunks so that distributed teams could produce the answer—and factored evaluation, an easier task (by alaogy to P<=NP). I thought of them as basically testing Paul Christiano’s ideas. They have moved on to trying to automate research and reasoning, by building software to help break complicated questions into subtasks that are simpler to evaluate and potentially automate. Research Saunders et al.’s Evaluating Arguments One Step at a Time provides a detailed analysis of some of Ought’s 2019 work on factored evaluation. They tried to break down opinions about movie reviews into discretely checkable sections between a friendly and adversarial agent. The trees they ended up using are quite small—just two layers, plus the root node, presumably because of the problems they had previously encountered with massive tree growth. It’s hard to judge the performance numbers they put out, because it’s not obvious what sort of performance we would expect from such a circumsized test, even conditional on this being a good approach, but the efficacy they report does not look that encouraging to me. #Amplification Byun & Stuhlmuller’s Automating reasoning about the future at Ought describes Ought’s new program of providing tools to help with people forecasting. This includes assigning probabilities and distributions to beliefs, vaguely similarly to Guestimate. They are now working on building a GPT-3 research assistant. #Amplification Finances They spent around $1,200,000 in 2019 and $1,200,000 in 2020, and plan to spend around $1,400,000 in 2020. Their 2020 spend was significantly below plan (around $2.5m) due to slower hiring and ending human participant experiments. They have around $3,100,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 2.2 years of runway. They are not looking for donations from the general public this year.

GPI: The Global Priorities Institute

GPI is an Oxford-based Academic Priorities Research organisation founded in 2018 by Hilary Greaves and part of Oxford University. They do work on philosophical issues likely to be very important for global prioritisation, much of which is, in my opinion, relevant to AI Alignment work. Their research can be found here. They recently took on two new economics postdocs (Benjamin Tereick and Loren Fryxell) and two new philosophy postdocs (David Thorstad and Jacob Barret) Research Trammell & Korinek’s Economic growth under transformative AI applies a variety of models of economic growth to the introduction of AI. These consider both a variety of models and a variety of ways AI could matter—is it a perfect substitute for labour? Do AIs make more AIs? - and summarises the results of this mathematical analysis. I particularly liked the way that discrete qualitative changes in economic regime fell out of the analysis. Overall I thought it did a nice job unifying the two disciplines. #Forecasting Mogensen’s Moral demands and the far future argues that, contra most people’s suppositions, egalitarian utilitarianism requires the present rich not to transfer resources to the present poor but to future generations. It argues this is true under various versions of population ethics. #Ethics Tarsney & Thomas’s Non-Additive Axiologies in Large Worlds argues that even average-utility type theories should care about the potential for adding many new happy people in the future, because all the past animals provide a large fixed utility background. This fixed utility makes the average behave like the sum, at least locally, so adding a large number of lives that are better off than the average historical rodent is very worthwhile. It’s not clear what we should do about aliens. I have always regarded these ideas as something of a reductio of average consequentialism and similar views, but it is nice to have a proof to show that even those who are convinced should care quite a lot (if not quite as much) as totalists about Xrisk. #Ethics Thorstad & Mogensen’s Heuristics for clueless agents: how to get away with ignoring what matters most in ordinary decision-making addresses the cluelessness problem—that the immense importance and uncertainty of the long run future leaves us clueless as to what do to—through the use of local heuristics. #DecisionTheory Tarsney’s Exceeding Expectations: Stochastic Dominance as a General Decision Theory suggests we can avoid some of the paradoxes of expected utility maximisation (e.g. St Petersburg Paradox) by using Stochastic Dominance. This basically comes down to arguing that we can make use of background assumptions to push the dominance condition to give us virtually all of the benefits of expectation maximisation, while avoiding the Pascalian type problems—and of course stochastic dominance is a prima facie attractive principle in itself. #DecisionTheory Mogensen & Thorstad’s Tough enough? Robust satisficing as a decision norm for long-term policy analysis advocates for ‘robust satisficing’, as an alternative to expectation maximisation, as a decision criteria in cases where there is ‘deep’ uncertainty. The aim is basically to give a firmer theoretical underpinning for engineers to use this relatively conservative approach in risky situations. #Strategy John & MacAskill’s Longtermist institutional reform described a number of potential governance changes we could make to try to represent the interests of future people better. These include impact assessments, people’s assemblies and separate legislative houses. I think this is a good project to work on, but I’m sceptical of these specific proposals; they seem a bit like a list of ‘policies that sound nice’ to me, without really considering all the problems—for example, our current use of environmental impact assessments seems to have had very negative consequences for our ability to build any new infrastructure, and I think there are good reasons sortition has rarely been used in practice. See also discussion here. #Politics Finances They spent £600,000 in 2018/​2019 (academic year) and £850,000 in 2019⁄20, which was less than their plan of £1,400,000 due to the pandemic, and intend to spend around £1,400,000 in 2020/​2021. They suggested that as part of Oxford University ‘cash on hand’ or ‘runway’ were not really meaningful concepts for them, as they need to fully-fund all employees for multiple years. If you want to donate to GPI, you can do so here.

CLR: The Center on Long Term Risk

CLR is a London (previously Germany) based Existential Risk Research organisation founded in 2013 and until recently lead by Jonas Vollmer (who has now moved to EA Funds). Until this year they were known as FRI (Foundational Research Institute) and were part of the Effective Altruism Foundation (EAF). They do research on a number of fundamental long-term issues, with AI as one of their top two focus areas (along with Malevolence, though that is still related). You can see their recent research summarised here. In general they adopt what they refer to as ‘suffering-focused’ ethics, which I think is a quite misguided view, albeit one they seem to approach thoughtfully. They recently hired Alex Lyzhov, Emery Cooper, Daniel Kokotajlo (from AI Impacts, possibly not permanent), and Julian Stastny as full-time research staff, Maxime Riché as a research engineer and Jia Yuan Loke as part-time. Research Althaus & Baumann’s Reducing long-term risks from malevolent actors analyses the dangers posed by very evil (score highly on the ‘dark triad’ traits) people, and suggests some possible techniques to reduce the risk. This detailed report, on an area I hadn’t seen much before, includes the context of whole brain emulation, AGI, etc. #Politics Clifton’s Equilibrium and prior selection problems in multipolar deployment describes the problem of ensuring desirable equilibria between multiple agents when they have different priors. The idea that different equilibria could be possible etc. is well known, but the contribution here is to point out that different priors between teams /​ agents could push you into a very bad equilibrium—for example, if your Saxons falsely believe the Vikings are bluffing. #AgentFoundations Clifton & Riche’s Towards Cooperation in Learning Games discusses the meta-game-theoretic problem of how to get AI teams to cooperate on the task of building AIs that will cooperate with each other. They introduce the idea of Learning TFT and run some experiments around its performance. #AgentFoundations Finances They spent around $1,400,000 in 2019, around $1,100,000 in 2020, and plan to spend around $1,800,000 in 2021. They have around $950,000 in reserves, suggesting (on a very naïve calculation) around 0.6 years of runway. Their 2019 spending was somewhat somewhat higher than they expected a year ago, based on FX changes and some unexpected items, especially related to travel and their move to the UK. They have a collaboration with the Swiss-based Center for Emerging Risk Research, who have agreed to fund 15% of their costs. If you wanted to donate to CLR, you could do so here.

CSET: The Center for Security and Emerging Technology

CSET is a Washington based Think Tank founded in 2019 by Jason Matheny (ex IARPA), affiliated with the University of Georgetown. They analyse new technologies for their security implications and provide advice to the US government. At the moment they are mainly focused on near-term AI issues. Their research can be found here. Research Hwang’s Shaping the Terrain of AI Competition discusses strategies for the US to compete with China in AI. In particular, these attempt to nullify the ‘natural’ advantages authoritarian or totalitarian states may have. #Politics Imbrie et al.’s The Question of Comparative Advantage in Artificial Intelligence: Enduring Strengths and Emerging Challenges for the United States discusses the relative advantages of the US and China in AI development. #Politics Finances As they apparently launched with $55m from the Open Philanthropy Project, and subsequently raised money from the Hewlett Foundation, I am assuming they do not need more donations at this time.

AI Impacts

AI Impacts is a San Francisco (previously Berkeley) based AI Strategy organisation founded in 2014 by Katja Grace and Paul Christiano. They are affiliated with (a project of, with independent financing from) MIRI. They do various pieces of strategic background work, especially on AI Timelines, AI takeoff speed etc. - it seems their previous work on the relative rarity of discontinuous progress has been relatively influential. Their research can be found here. During the year Kokotajlo left (temporarily?) for CLR, and Asya may be leaving for FHI. edit 2020-12-25: They have now published an annual review here. Research A lot of the work on the website is essentially in the form of a continuously updated private wiki—see here. This makes it a little difficult for our typical technique, which relies on being able to evaluate specific publications which are released at specific times. As such it is a little unfortunate that in the below we generally concentrate on their timestamped blogposts. They suggested readers might be interested in posts like these ones.
They have produced a series of pieces on how long it has historically taken for AIs to cover the human range (from beginner to expert to superhuman) for different tasks. This seems relevant because people only seem to really pay attention to AI progress in a field when it starts beating humans. These pieces include Starcraft, ImageNet, Go, Chess and Draughts. Grace’s Discontinuous progress in history: an update details their extensive research into examples of discontinuities in technological progress. They find 10 such examples, across construction, travel, weapons and compute. As well as being a very pleasant read, they had some interesting conclusions, for example that the discontinuities often occurred in non-optimised secondary features, and many occured when something became just good enough to pass a threshold on another feature. Especially interesting to me is some of the things they found to not be discontinuities: AlexNet and Chess AI. Could this mean that future progress could ‘feel’ discontinuous in some important sense even if it doesn’t register as such on some objective benchmark ? The individual trend writeups (e.g. penacillin here) are also interesting. See also here. #Forecasting Kokotajlo’s Three kinds of competitiveness distinguishes between AI systems that will outperform, those that will be cheaper, and those that will arrive sooner. This is a very simple dichotomy that actually helped make things clearer; the post contains just enough to make the point and significance clear. #Strategy Korzekwa’s Description vs simulated prediction describes the difference between modelling how steady technological progress was in the past, and thinking about how predictable it was in the past. For example, the speedup that aeroplanes offered for transatlantic travel (relative to ships) was presumably quite predictable to someone who knew about progress in aeronautics, even though it was very sudden. #Forecasting Kokotajlo’s Relevant pre-AGI possibilities is a scenario simulator for different future developments. Basically you enter probabilities for a bunch of relevant things that could happen and it randomly generates a future. By clicking repeatedly, you can get a representative sense for the sort of futures your beliefs entail. #Forecasting Korzekwa’s Preliminary survey of prescient actions attempts to find historical cases where humans have taken advance action to solve an unprecedented problem. It does not find any examples better than the classic Szilard case. This could be good news—that, in practice, there is always feedback, so the problem is not as easy as we thought—or it could be bad news—we have to solve a type of problem we have literally never solved before (or not very much news, to the extent it is only preliminary). #Forecasting Grace’s Atari early notes that AI mastery of Atari games seems to have arrived significantly earlier than experts previously expected. #Forecasting Finances They spent $315,000 in 2019 and $300,000 in 2020, and plan to spend around $200,000 in 2021. They have around $190,000 in cash and pledged funding, suggesting (on a very naïve calculation) around 0.95 years of runway. In the past they have received support from EA organisations like OpenPhil and FHI. MIRI administers their finances on their behalf; donations can be made here.

Leverhulme Center for the Future of Intelligence

Leverhulme is a Cambridge based Research organisation founded in 2015 and currently lad by Stephen Cave. They are affiliated with Cambridge University and closely linked to CSER. They do work on a variety of AI related causes, mainly on near-term issues but also some long-term. You can find their publications here. They have a document listing some of their achievements here. Research Leverhulme-affiliated researchers produced work on a variety of topics; I have only here summarised that which seemed the most relevant to AI safety. Hernandez-Orallo et al.’s AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues performs algorithmic analysis of AI papers to determine trends. One interesting thing they pick up on (perhaps obvious in retrospect) is that (generally near-term) ‘safety’ related papers peak within any given paradigm after the paradigm itself. Researchers from CSER were also named authors on the paper. #Strategy Whittlestone & Ovadya’s The tension between openness and prudence in responsible AI research discusses the conflict between traditional CS openness norms and the new ones we are trying to create. They decompose this conflict in various ways. The focus of the paper is on near-term issues, but the principle clearly matters for the big issue. Researchers from Leverhulme were also named authors on the paper. #Strategy Crosby et al.’s The Animal-AI Testbed and Competition produces a series of tests for AI ability based on animal IQ tests. This is an alternative to traditional tests like Atari, with the appeal being their practical relevance and reduced overfitting (as some of the tests are not in the training data). Presumably the benefit here is to improve out-of-distribution performance. #Misc Zerilli et al.‘s Algorithmic Decision-Making and the Control Problem discusses the problem of humans growing complacent and overly deferential towards AI systems they are meant to be monitoring. If the system is ‘always right’, eventually you are just going to click ‘confirm’ without thinking. #NearAI Peters et al.’s Responsible AI—Two Frameworks for Ethical Design Practice discusses some ethical principles for engineers #NearAI Hollanek’s AI transparency: a matter of reconciling design with critique attempts to apply literary criticism to AI transparency. #NearAI Bhatt et al.’s Machine Learning Explainability for External Stakeholders gathered focus groups to discuss how to make AI transparent to outsiders (not just designers) #NearAI Cave & Dihal’s The Whiteness of AI worries that too many AIs are depicted as being coloured white. It seems to me it would be roughly equally (im)plausible to say it would be problematic if robots (from the slavic word for forced labour) were black. #NearAI Leverhulme researchers contributed to the following research led by other organisations:

AI Safety camp

AISC is an internationally based independent residential research camp organisation founded in 2018 by Linda Linsefors and currently led by Remmelt Ellen. They bring together people who want to start doing technical AI research, hosting a 10-day camp aiming to produce publishable research. Their research can be found here. To the extent they can provide an on-ramp to get more technically proficient researchers into the field I think this is potentially very valuable. But I haven’t personally experienced the camps, or even spoken to anyone who has. Research Makiievskyi et al.’s Assessing Generalization in Reward Learning with Procedurally Generated Games try to train RL algorithms on various games to generalise to new environments. They generally found this was difficult. #RL Finances They spent $23,085 in 2019 and $11,162 in 2020, and plan to spend around $53,000 in 2021. They have around $28,851 in cash and pledged funding, suggesting (on a very naïve calculation) around 0.5 years of runway. They are run by volunteers, and are considering professionalising, depending on the amount of donations they receive. If you want to donate, the web page is here.

FLI: The Future of Life Institute

FLI is a Boston-based independent existential risk organization, focusing on outreach, founded in large part to help organise the regranting of $10m from Elon Musk. One of their major projects is trying to ban Lethal Autonomous Weapons. They wrote a letter to the EU advising for stricter regulation, with 120 signituries, here. They have a very good podcast on AI Alignment here. Research Aguirre’s Why those who care about catastrophic and existential risk should care about autonomous weapons argues that we should work towards a ban on Lethal Autonomous Weapons. This is not only because they might be destabilising WMDs, but also as a ‘practice run’ for future regulation of AI. #NearAI

Convergence

Convergence is a globally based independent Existential Risk Research organisation, of which Justin Shovelain founded an earlier version in 2015 and David Kristoffersson joined as cofounder in 2018. They do strategic research about Xrisks in general as well as some AI specific work. Their research can be found here. Their short summary can be found here. Justin Shovelain and David Kristoffersson are the two full-time members of Convergence, but they have had other people on part-time for periods of time, such as Michael Aird in the first half of 2020, and Alexandra Johnson. Research Shovelain & Aird ’s Using vector fields to visualise preferences and make them consistent discusses the idea of using vector fields as a representation of local preferences, and then using curl as a measure of their consistency. I liked this as a clear and less blackboxy-than-ML account of how preferences were being represented. It would be good to see some more on whether the helmholtz theorem gives us the sorts of properties we want in addition to removing the curl. #ValueLearning Aird’s Existential risks are not just about humanity argues that, despite its being technically excluded from the definition, we should take into account the possibility of positive-value alien-originating life when we consider existential risks. #Strategy Aird et al.’s Memetic downside risks: How ideas can evolve and cause harm discusses the risk of ideas becoming distorted over time in the retelling. This includes predictions about the average direction in which memes will evolve: for example, towards simplicity. (They suggested this might be a more important article on a similar subject but I haven’t had time to read) #Strategy They suggested readers might also be interested in this, this and this. Finances They spent $50,000 in 2019 and $13,000 in 2020, and plan to spend around $30,000 in 2021. They have around $37000 in cash and pledged funding, suggesting (on a very naïve calculation) around 1.2 years of runway. Though they are not actively seeking donations at the moment, if you wanted to donate you could do so here.

Median Group

Median is a Berkeley based independent AI Strategy organisation founded in 2018 by Jessica Taylor, Bryce Hidysmith, Jack Gallagher, Ben Hoffman, Colleen McKenzie, and Baeo Maltinsky. They do research on various risks, including AI timelines. Their research can be found here. Their website does not list any relevant research for 2020. They did not reply when I asked them about their finances. Median doesn’t seem to be soliciting donations from the general public at this time.

AI Pulse

The Program on Understanding Law, Science, and Evidence (PULSE) is part of the UCLA School of Law, and contains a group working on AI policy. They were founded in 2017 with a $1.5m grant from OpenPhil. Their website does not list any research for 2020 that seemed relevant to existential safety.

Other Research

I would like to emphasize that there is a lot of research I didn’t have time to review, especially in this section, as I focused on reading organisation-donation-relevant pieces. So please do not consider it an insult that your work was overlooked! Benadè et al.’s Preference Elicitation for Participatory Budgeting works on how to get people to share their preferences, and then combine this information. In particular they separate the preference-inferring step from the aggregation step, exploring multiple input and aggregation methodologies. Some of this paper was from 2016 but I missed it then and figured enough was new to warrant a mention here. #ValueLearning Qian et al.’s AI GOVERNANCE IN 2019 A YEAR IN REVIEW is a collected volume of articles on governance from over 50 different authors. Both China and the West are well represented. (I have not read all the individual articles) Researchers from OpenAI,CHAI,CSER were also named authors on the paper. #Politics Krakovna’s Possible takeaways from the coronavirus pandemic for slow AI takeoff discusses the significance of our covid performance for AGI strategy. It discusses the ways in which, even though the pandemic was quite slow moving and clearly predictably disastrous, western governments failed to act, suggesting there might be similar failures in a slow AGI takeoff. I also recommend Wei’s comment, which points out that the disaster easily became politicised—it is truly impressive (-ly dire) that in the US the partisan positions in the US managed to flip three times without ever producing an effective response. Indeed it seems plausible to me that on net government intervention made the pandemic worse. (The author works for FLI and Deepmind but this seems to be a separate ‘personal’ article). See also the discussion here. #Strategy Ngo’s AGI Safety from First Principles presents Richard’s account of the case for AI risk. This is basically the idea that, by creating AGI, humankind might end up as only the world’s second most powerful species. I think most readers will probably (unsurprisingly) agree with him here; it seems like a very good account of the core argument, which is nice to have newer versions of. #Overview Ecoffet & Adrien’s Reinforcement Learning Under Moral Uncertainty is the first paper I’ve seen trying to impliment and test different approaches to moral uncertainty in an RL setting. Obveously harkening to Will’s thesis, though they restrict to theories with cardinal utilities only—which seems, to my mind, to assume away the hardest part. They compare expectation maximisation to voting systems, and test on trolley problems. #RL Hendrycks et al.’s Aligning AI with Shared Human Values showcases a data set of moral examples (e.g. property damage is wrong) and trains various transformer text algorithms on it. I like the way they use deliberately uncontroversial examples; I think we will do much better if we can get agents who get 99% of situations correct that by re-litigating the culture war by proxy. As a first pass we should consider their results as a sort of benchmark for future work using the database. Researchers from CHAI were also named authors on the paper. # Benaich & Hogarth’s State of AI Report 2020 is an overview of the AI industry in 2020 by two investors. It is very detailed, but not that directly relevant. #Overview Wilkinson’s In defence of fanaticism offers the first defence of EV maximisation fanaticism that I have ever seen. It includes both counterarguments against the common rejections (which lets face it often resemble David Lewis’s incredulous stare), as well as two nice dilemmas for the non-fanaticism. See also the discussion here. #DecisionTheory Linsefors & Hepburn’s Announcing AI Safety Support describes a group they have created to try to support people entering the field. #Strategy Aird’s Failures in technology forecasting? A reply to Ord and Yudkowsky discusses the examples that Eliezer and Toby use as evidence for the difficulty in predicting technological development, and argues that it is not so clear that these really show this exactly. For example, the quote about Wilbur Wright doubting the possibility of flight looks more like a moment of depression than a forecast that would have been taken seriously by contemporaries. Overall I thought his "these examples seem somewhat cherry-picked" argument was the most convincing. #Forecasting Scholl & Hanson’s Testing the Automation Revolution Hypothesis evaluate predictions of AI-driven unemployment. They find that these predictions have had low but positive explanatory value for predicting which jobs would be automated so far. Researchers from FHI were also named authors on the paper. #NearAI Xu et al.’s Recipes for Safety in Open-domain Chatbots discusses various ways of preventing a chatbot from saying offensive things. #ValueLearning

Capital Allocators

One of my goals with this document is to help donors make an informed choice between the different organisations. However, it is quite possible that you regard this as too difficult, and wish instead to donate to someone else who will allocate on your behalf. This is of course much easier; now instead of having to solve the Organisation Evaluation Problem, all you need to do is solve the dramatically simpler Organisation Evaluator Organisation Evaluation Problem.

LTFF: Long-term future fund

LTFF is a globally based EA grantmaking organisation founded in 2017, currently lead by Matt Wage and affiliated with CEA, but probably becoming independent (along with the other EA funds under Jonas Vollmer) in 2021. They are one of four funds set up by CEA to allow individual donors to benefit from specialised capital allocators; this one focuses on long-term future issues, including a large focus on AI Alignment. Their website is here. There are write-ups for their three grant rounds in 2020 are here, here and here, and comments here, here and here. As the November 2019 round was not public when I wrote last year I have included it in some of the analysis below. They also did a AMA recently here. The fund is now run by five people, and the grants have gone to a wide variety of causes, many of which would simply not be accessible to individual donors. The fund managers are currently:

OpenPhil: The Open Philanthropy Project

The Open Philanthropy Project (separated from Givewell in 2017) is an organisation dedicated to advising Cari and Dustin Moskovitz on how to give away over $15bn to a variety of causes, including existential risk. They have made extensive donations in this area and probably represent both the largest pool of EA-aligned capital and the largest team of EA capital allocators. They recently described their strategy for AI governance, at a very high level, here. It is possible that the partnership with Ben Delo we discussed last year may not occur. Grants You can see their grants for AI Risk here. It lists 21 AI Risk grants in 2020, plus 4 others for global catastrophic risks and several highly relevant ‘other’ grants. In total I estimate they spent about $19m on AI in 2020. The largest grants were:

SFF: The Survival and Flourishing Fund

SFF (website) is a donor advised fund, advised by the people who make up BERI’s Board of Directors. SFF was initially funded in 2019 by a grant of approximately $2 million from BERI, which in turn was funded by donations from philanthropist Jaan Tallinn, now also distributing money from Jed McCaleb. Grants In its grantmaking SFF uses an innovative allocation process to combine the views of many grant evaluators (described here). SSF has published the results of one grantmaking round this year (described here), where they donated around $1.8m, of which I estimate around $1.2m was AI related; the largest donations in the round were to:

Other Organisations

80,000 Hours

80k provides career advice and guidance to people interested in improving the world, with a specific focus on AI safety. 80,000 Hours’s AI/​ML safety research job board collects various jobs that could be valuable for people interested in AI safety. At the time of writing it listed 80 positions, all of which seemed like good options that it would be valuable to have sensible people fill. I suspect most people looking for AI jobs would find some on here they hadn’t heard of otherwise, though of course for any given person many will not be appropriate. They also have job boards for other EA causes. #Careers They also run a very good podcast; readers might be specifically interested in this or this.

Other News

NeuroIPS rejected four papers this year for being ‘unethical’. Waymo is (finally) offering a true driverless Uber experience to the general public in Phoenix. The pope suggested we pray for AI alignment. There was a minor pandemic.

Methodological Thoughts

Inside View vs Outside View

This document is written mainly, but not exclusively, using publicly available information. In the tradition of active management, I hope to synthesise many pieces of individually well known facts into a whole which provides new and useful insight to readers. Advantages of this are that 1) it is relatively unbiased, compared to inside information which invariably favours those you are close to socially and 2) most of it is legible and verifiable to readers. The disadvantage is that there are probably many pertinent facts that I am not a party to! Wei Dai has written about how much discussion now takes place in private google documents – for example this Drexler piece apparently; in most cases I do not have access to these. If you want the inside scoop I am not your guy; all I can supply is exterior scooping. We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it’s hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/​political work.

Organisations vs Individuals

Many capital allocators in the bay area seem to operate under a sort of Great Man theory of investment, whereby the most important thing is to identify a guy to invest in who is really clever and ‘gets it’. I think there is a lot of merit in this (as argued here for example); however, I think I believe in it less than they do. Perhaps as a result of my institutional investment background, I place a lot more weight on historical results. In particular, I worry that this approach leads to over-funding skilled rhetoricians and those the investor/​donor is socially connected to. Also, as a practical matter, it is hard for individual donors to fund individual researchers. But as part of a concession to the individual-first view I’ve started asking organisations if anyone significant has joined or left recently, though in practice I think organisations are far more willing to highlight new people joining than old people leaving. Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is the correct approach for donors who are not tightly connected to the organisations in question. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here. (I actually think that even those with close personal knowledge should use historical results more, to help overcome their biases.) This judgement involves analysing a large number of papers relating to Xrisk that were produced during 2020. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric; I have tried to indicate cases where this doesn’t apply. I also attempted to include papers during December 2019, to take into account the fact that I’m missing the last month’s worth of output from 2020, but I can’t be sure I did this successfully.

Research Inclusion Criteria

In general I have tried to evaluate and summarise, at least briefly, the work organisations did that is primarily concerned with AI or general Xrisk strategy. But this has been a rather subjective and imperfectly applied criteria that was primarily implemented through my subjective sense of ‘does this seem relevant to the task at hand’.

Politics

My impression is that policy on most subjects, especially those that are more technical than emotional is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers in academia and industry) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant technophobic opposition to GM foods or other kinds of progress. We don’t want the ‘us-vs-them’ situation that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as an imposition and encumbrance to be endured or evaded, will probably be harder to convince of the need to voluntarily be extra-safe—especially as the regulations may actually be totally ineffective. The only case I can think of where scientists are relatively happy about punitive safety regulations, nuclear power, is one where many of those initially concerned were scientists themselves, and also had the effect of basically ending any progress in nuclear power (at great cost to climate change). Given this, I actually think policy outreach to the general population is probably negative in expectation. If you’re interested in this, I’d recommend you read this blog post from a few years back.

Openness

I think there is a strong case to be made that openness in AGI capacity development is bad. As such I do not ascribe any positive value to programs to ‘democratize AI’ or similar. One interesting question is how to evaluate non-public research. For a lot of safety research, openness is clearly the best strategy. But what about safety research that has, or potentially has, capabilities implications, or other infohazards? In this case it seems best if the researchers do not publish it. However, this leaves funders in a tough position – how can we judge researchers if we cannot read their work? Maybe instead of doing top secret valuable research they are just slacking off. If we donate to people who say "trust me, it’s very important and has to be secret" we risk being taken advantage of by charlatans; but if we refuse to fund, we incentivize people to reveal possible infohazards for the sake of money. (Is it even a good idea to publicise that someone else is doing secret research?) For similar reasons I prefer research to not be behind paywalls or inside expensive books, but this seems a significantly less important issue. More prosaically, organisations should make sure to upload the research they have published to their website! Having gone to all the trouble of doing useful research it is a constant shock to me how many organisations don’t take this simple step to significantly increase the reach of their work. Additionally, several times I have come across incorrect information on organisation’s websites.

Research Flywheel

My basic model for AI safety success is this:

Differential AI progress

There are many problems that need to be solved before we have safe general AI, one of which is not producing unsafe general AI in the meantime. If nobody was doing non-safety-conscious research there would be little risk or haste to AGI – though we would be missing out on the potential benefits of safe AI. There are several consequences of this:

Financial Reserves

Charities like having financial reserves to provide runway, and guarantee that they will be able to keep the lights on for the immediate future. This could be justified if you thought that charities were expensive to create and destroy, and were worried about this occurring by accident due to the whims of donors. Unlike a company which sells a product, it seems reasonable that charities should be more concerned about this. Donors prefer charities to not have too much reserves. Firstly, those reserves are cash that could be being spent on outcomes now, by either the specific charity or others. Valuable future activities by charities are supported by future donations; they do not need to be pre-funded. Additionally, having reserves increases the risk of organisations ‘going rogue’, because they are insulated from the need to convince donors of their value. As such, in general I do not give full credence to charities saying they need more funding because they want much more than a 18 months or so of runway in the bank. If you have a year’s reserves now, after this December you will have that plus whatever you raise now, giving you a margin of safety before raising again next year. I estimated reserves = (cash and grants) /​ (2021 budget). In general I think of this as something of a measure of urgency. However despite being prima facie a very simple calculation there are many issues with this data. As such these should be considered suggestive only.

Donation Matching

In general I believe that charity-specific donation matching schemes are somewhat dishonest, despite my having provided matching funding for at least one in the past. Ironically, despite this view being espoused by GiveWell (albeit in 2011), this is essentially of OpenPhil’s policy of, at least in some cases, artificially limiting their funding to 50% or 60% of a charity’s need, which some charities have argued effectively provides a 1:1 match for outside donors. I think this is bad. In the best case this forces outside donors to step in, imposing marketing costs on the charity and research costs on the donors. In the worst case it leaves valuable projects unfunded. Obviously cause-neutral donation matching is different and should be exploited. Everyone should max out their corporate matching programs if possible, and things like the annual Facebook Match continue to be great opportunities.

Poor Quality Research

Partly thanks to the efforts of the community, the field of AI safety is considerably more well respected and funded than was previously the case, which has attracted a lot of new researchers. While generally good, one side effect of this (perhaps combined with the fact that many low-hanging fruits of the insight tree have been plucked) is that a considerable amount of low-quality work has been produced. For example, there are a lot of papers which can be accurately summarized as asserting "just use ML to learn ethics". Furthermore, the conventional peer review system seems to be extremely bad at dealing with this issue. The standard view here is just to ignore low quality work. This has many advantages, for example 1) it requires little effort, 2) it doesn’t annoy people. This conspiracy of silence seems to be the strategy adopted by most scientific fields, except in extreme cases like anti-vaxers. However, I think there are some downsides to this strategy. A sufficiently large milieu of low-quality work might degrade the reputation of the field, deterring potentially high-quality contributors. While low-quality contributions might help improve Concrete Problems’ citation count, they may use up scarce funding. Moreover, it is not clear to me that ‘just ignore it’ really generalizes as a community strategy. Perhaps you, enlightened reader, can judge that "How to solve AI Ethics: Just use RNNs" is not great. But is it really efficient to require everyone to independently work this out? Furthermore, I suspect that the idea that we can all just ignore the weak stuff is somewhat an example of typical mind fallacy. Several times I have come across people I respect according respect to work I found clearly pointless. And several times I have come across people I respect arguing persuasively that work I had previously respected was very bad – but I only learnt they believed this by chance! So I think it is quite possible that many people will waste a lot of time as a result of this strategy, especially if they don’t happen to move in the right social circles. Having said all that, I am not a fan of unilateral action, and am somewhat selfishly conflict-averse, so will largely continue to abide by this non-aggression convention. My only deviation here is to make it explicit. If you’re interested in this you might enjoy this by 80,000 Hours.

The Bay Area

Much of the AI and EA communities, and especially the EA community concerned with AI, is located in the Bay Area, especially Berkeley and San Francisco. It does have advantages—like proximity to good CS universities—but it is an extremely expensive place, and is dysfunctional both politically and socially. Aside from the lack of electricity and aggressive homelessness, it seems to attract people who are extremely weird in socially undesirable ways – and induces this in those who move there—though to be fair the people who are doing useful work in AI organisations seem to be drawn from a better distribution than the broader community. In general I think the centralization is bad, but if there must be centralization I would prefer it be almost anywhere other than Berkeley. Additionally, I think many funders are geographically myopic, and biased towards funding things in the Bay Area. As such, I have a mild preference towards funding non-Bay-Area projects.

Conclusions

The size of the field continues to grow, both in terms of funding and researchers. Both make it increasingly hard for individual donors. I’ve attempted to subjectively weigh the productivity of the different organisations against the resources they used to generate that output, and donate accordingly. My constant wish is to promote a lively intellect and independent decision-making among readers; hopefully my laying out the facts as I see them above will prove helpful to some readers. Here is my eventual decision, rot13′d so you can do come to your own conclusions first (which I strongly recommend): Na vapernfvatyl ynetr nzbhag bs gur orfg jbex vf orvat qbar va cynprf gung qb abg frrz yvxryl gb orarsvg sebz znetvany shaqvat: SUV, Qrrczvaq, BcraNV rgp. Juvyr n tbbq qrirybczrag birenyy—V nz pregnvayl irel cyrnfrq gung Qrrczvaq naq BcraNV unir fhpu cebqhpgvir grnzf—vg zrnaf jr pna’g ernyyl qb zhpu urer. ZVEV frrzf gb unir tbbq crbcyr naq n tbbq genpx erpbeq, naq gurl fznyy nzbhag gurl eryrnfr vf fgebat. Ohg V pna’g rasbepr shaqvat n ynetr betnavfngvba jvgubhg gnatvoyr rivqrapr sbe znal lrnef. Bs gur cynprf qbvat svefg-pynff grpuavpny erfrnepu, PUNV frrzf gb zr gb or gur bar gung pbhyq zbfg perqvoyl orarsvg sebz zber shaqvat. V nz n yvggyr pbaprearq Ebuva vf yrnivat, nf ur jnf n irel fgebat pbagevohgbe, naq gur evfx jvgu npnqrzvp vafgvghgvbaf vf gurl trg ‘qvfgenpgrq’. Ohg birenyy V guvax gurl erznva irel cebzvfvat fb V vagraq gb znxr n fvtavsvpnag qbangvba urer. Va gur cnfg V unir orra dhvgr unefu ba PFRE orpnhfr V sryg gung n ybg bs gurve jbex jnf abg irel eryrinag. Vg qbrf frrz fhowrpgviryl gb zr gung gurve cebqhpgvivgl naq sbphf unf fvtavsvpnagyl vzcebirq ubjrire. V guvax OREV ner irel vagrerfgvat. Gurve fgengrtl frrzf gb bssre gur punapr gb fvtavsvpnagyl obbfg npnqrzvp (naq guhf znvafgernz-pbaarpgrq naq fgnghf vzohvat) erfrnepu juvyr znvagnvavat n sbphf ba gur zvffvba gung zvtug or ybfg jvgu qverpg tenagf. Zl bar pbaprea urer vf gung gurl ner fbzrguvat bs n bar-zna bcrengvba, naq juvyr V jnf irel snzvyvne jvgu Pevgpu V xabj irel yvggyr nobhg Fnjlre. Ohg birenyy V guvax guvf vf irel cebzvfvat fb V jvyy cebonoyl or qbangvat. Abgr gung guvf vf vaqverpgyl fhccbegvat PFRE nf jryy nf bgure betf yvxr SUV, PUNV rgp. Svanyyl, V pbagvahr gb yvxr gur YGSS. V’z n yvggyr pbaprearq nobhg hcpbzvat cbffvoyr crefbaary punatrf jura gurl fcva bhg bs PRN, naq jbhyq cersre vs gurl qvqa’g tenag gb betnavfngvbaf ynetr rabhtu gb eha gurve bja shaqenvfvat pnzcnvtaf (naq urapr pna or rinyhngrq ol vaqvivqhny qbabef). Ohg birenyy V guvax vg vf irel nggenpgvir gb shaq fznyy cebwrpgf, naq V nz abg njner bs nal bgure nirahr sbe fznyy qbabef gb genpgnoyl qb guvf. Fb V jvyy or qbangvat gb gurz ntnva guvf lrne. However, I wish to emphasize that all the above organisations seem to be doing good work on the most important issue facing mankind. It is the nature of making decisions under scarcity that we must prioritize some over others, and I hope that all organisations will understand that this necessarily involves negative comparisons at times. Thanks for reading this far; hopefully you found it useful. Apologies to everyone who did valuable work that I excluded! If you found this post helpful, and especially if it helped inform your donations, please consider letting me and any organisations you donate to as a result know. If you are interested in helping out with next year’s article, please get in touch, and perhaps we can work something out.

Disclosures

I have not in general checked all the proofs in these papers, and similarly trust that researchers have honestly reported the results of their simulations. I was a Summer Fellow at MIRI back when it was SIAI and volunteered briefly at GWWC (part of CEA). My wife has done some contract work for OpenPhil. I have no financial ties beyond being a donor and have never been romantically involved with anyone else who has ever worked at any of the other organisations. I shared drafts of the individual organisation sections with representatives from LTFF, FHI, MIRI, CHAI, GCRI, CSER, Ought, AI Impacts, BERI, CLR, GPI, OpenPhil, Convergence. My eternal gratitude to my anonymous reviewers for their invaluable help, and especially Jess Riedel for the volume and insight of his comments. Any remaining mistakes are of course my own. I would also like to thank my wife and daughter for tolerating all the time I have spent/​invested/​wasted on this. Negative thanks goes to The Wuhan Institute of Virology and Paradox Interactive.

Sources

This is a list of all the articles cited who with their own individual paragraph. It does not include articles that are only referenced in-line, typically with the word ‘here’. Aird, Michael—Existential risks are not just about humanity − 2020-04-27 - https://​​forum.effectivealtruism.org/​​posts/​​EfCCgpvQX359xuZ4g/​​are-existential-risks-just-about-humanity Aird, Michael—Failures in technology forecasting? A reply to Ord and Yudkowsky − 2020-05-08 - https://​​www.lesswrong.com/​​posts/​​3qypPmmNHEmqegoFF/​​failures-in-technology-forecasting-a-reply-to-ord-and Aird, Michael; Shovelain, Justin—Using vector fields to visualise preferences and make them consistent − 2020-01-28 - https://​​www.lesswrong.com/​​posts/​​ky988ePJvCRhmCwGo/​​using-vector-fields-to-visualise-preferences-and-make-them#comments Aird, Michael; Shovelain, Justin; Kristoffersson, David - Memetic downside risks: How ideas can evolve and cause harm − 2020-02-25 - https://​​www.lesswrong.com/​​posts/​​EdAHNdbkGR6ndAPJD/​​memetic-downside-risks-how-ideas-can-evolve-and-cause-harm AlphaFold Team—AlphaFold: a solution to a 50-year-old grand challenge in biology − 2020-11-30 - https://​​deepmind.com/​​blog/​​article/​​alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology Althaus, David; Baumann, Tobias—Reducing long-term risks from malevolent actors − 2020-04-29 - https://​​forum.effectivealtruism.org/​​posts/​​LpkXtFXdsRd4rG8Kb/​​reducing-long-term-risks-from-malevolent-actors#comments Aquirre, Anthony—Why those who care about catastrophic and existential risk should care about autonomous weapons − 2020-11-11 - https://​​www.lesswrong.com/​​posts/​​Btrmh6T62tB4g9RMc/​​why-those-who-care-about-catastrophic-and-existential-risk#comments Armstrong, Stuart; Leike, Jan; Orseau, Laurent; Legg, Shane—Pitfalls of Learning a Reward Function Online − 2020-04-28 - https://​​arxiv.org/​​abs/​​2004.13654 Ashurst, Carolyn; Anderljung, Markus; Prunkl, Carina; Leike, Jan; Gal, Yarin; Shevlane, Toby; Dafoe, Allan—A Guide to Writing the NeurIPS Impact Statement − 2020-05-13 - https://​​medium.com/​​@GovAI/​​a-guide-to-writing-the-neurips-impact-statement-4293b723f832 Avin, Sharar; Gruetzemacher, Ross; Fox, James—Exploring AI Futures Through Role Play − 2020-02-26 - https://​​arxiv.org/​​abs/​​1912.08964 Barnes, Beth; Christiano, Paul—Writeup: Progress on AI Safety via Debate − 2020-02-05 - https://​​www.alignmentforum.org/​​posts/​​Br4xDbYu4Frwrb64a/​​writeup-progress-on-ai-safety-via-debate-1 Baum, Seth—Accounting for violent conflict risk in planetary defense decisions − 2020-09-09 - http://​​gcrinstitute.org/​​accounting-for-violent-conflict-risk-in-planetary-defense-decisions/​​ Baum, Seth—Artificial Interdisciplinarity: Artificial Intelligence for Research on Complex Societal Problems − 2020-07-14 - http://​​gcrinstitute.org/​​artificial-interdisciplinarity-artificial-intelligence-for-research-on-complex-societal-problems/​​ Baum, Seth—Medium-Term Artificial Intelligence and Society − 2020-02-16 - http://​​gcrinstitute.org/​​medium-term-artificial-intelligence-and-society/​​ Baum, Seth—Quantifying the Probability of Existential Catastrophe: A Reply to Beard et al. − 2020-08-10 - http://​​gcrinstitute.org/​​quantifying-the-probability-of-existential-catastrophe-a-reply-to-beard-et-al/​​ Beard, Simon; Kaxzmarek, Patrick—On the Wrongness of Human Extinction − 2020-02-21 - https://​​www.cser.ac.uk/​​resources/​​wrongness-human-extinction/​​ Beard, Simon; Rowe, Thomas; Fox, James—An Analysis and Evaluation of Methods Currently Used to Quantify the Likelihood of Existential Hazards − 2019-12-03 - https://​​www.sciencedirect.com/​​science/​​article/​​pii/​​S0016328719303313 Beard, Simon; Rowe, Thomas; Fox, James—Existential risk assessment: A reply to Baum − 2020-07-15 - https://​​sci-hub.do/​​10.1016/​​j.futures.2020.102606 Belfield, Haydn—Activism by the AI Community: Analysing Recent Achievements and Future Prospects − 2020-02-26 - https://​​www.cser.ac.uk/​​resources/​​activism-ai-community-analysing-recent-achievements-and-future-prospects/​​ Belfield, Haydn; Hernández-Orallo, José; hÉigeartaigh, Seán Ó; Maas, Matthijs M.; Hagerty, Alexa; Whittlestone, Jess—Response to the European Commission’s consultation on AI − 2020-02-19 - https://​​www.cser.ac.uk/​​resources/​​response-european-commissions-consultation-ai/​​ Benadè, Gerdus; Nath, Swaprava; Procaccia, Ariel D.; Shah, Nisarg—Preference Elicitation for Participatory Budgeting − 2020-10-27 - https://​​pubsonline.informs.org/​​doi/​​10.1287/​​mnsc.2020.3666 Benaich, Nathan; Hogarth, Ian—State of AI Report 2020 − 2020-09-01 - https://​​docs.google.com/​​presentation/​​d/​​1ZUimafgXCBSLsgbacd6-a-dqO7yLyzIl1ZJbiCBUUT4/​​edit#slide=id.g9348791e5b_1_7 Bhatt, Umang; Andrus, McKane; Weller, Adrian; Xiang, Alice—Machine Learning Explainability for External Stakeholders − 2020-07-10 - https://​​arxiv.org/​​abs/​​2007.05408v1 Bobu, Andreea; Scobee, Dexter R.R.; Fisac, Jaime F.; Sastry, S. Shankar; Dragan, Anca D. - LESS is More: Rethinking Probabilistic Models of Human Behavior − 2020-01-13 - https://​​arxiv.org/​​abs/​​2001.04465 Bostom, Nick; Shulman, Carl—Sharing the World with Digital Minds − 2020-10-01 - http://​​www.nickbostrom.com/​​papers/​​monster.pdf Bostrom, Nick; Belfield, Haydn; Hilton, Sam—Written Evidence to the UK Parliament Science & Technology Committee’s Inquiry on A new UK research funding agency. − 2020-09-16 - https://​​www.cser.ac.uk/​​resources/​​written-evidence-uk-arpa-key-recommendations/​​ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario—Language Models are Few-Shot Learners − 2020-05-28 - https://​​arxiv.org/​​abs/​​2005.14165 Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian; Khlaaf, Heidy; Yang, Jingying; Toner, Helen; Fong, Ruth; Maharaj, Tegan; Koh, Pang Wei; Hooker, Sara; Leung, Jade; Trask, Andrew; Bluemke, Emma; Lebensold, Jonathan; O’Keefe, Cullen; Koren, Mark; Ryffel, Théo; Rubinovitz, JB; Besiroglu, Tamay; Carugati, Federica; Clark, Jack; Eckersley, Peter; Haas, Sarah de; Johnson, Maritza; Laurie, Ben; Ingerman, Alex; Krawczuk, Igor; Askell, Amanda; Cammarota, Rosario; Lohn, Andrew; Krueger, David; Stix, Charlotte; Henderson, Peter; Graham, Logan; Prunkl, Carina; Martin, Bianca; Seger, Elizabeth; Zilberman, Noa; hÉigeartaigh, Seán Ó; Kroeger, Frens; Sastry, Girish; Kagan, Rebecca; Weller, Adrian; Tse, Brian; Barnes, Elizabeth; Dafoe, Allan; Scharre, Paul; Herbert-Voss, Ariel; Rasser, Martijn; Sodhani, Shagun; Flynn, Carrick; Gilbert, Thomas Krendl; Dyer, Lisa; Khan, Saif; Bengio, Yoshua; Anderljung, Markus—Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims − 2020-04-15 - https://​​arxiv.org/​​abs/​​2004.07213 Burden, John & Hernandez-Orallo, Jose—Exploring AI Safety in Degrees: Generality, Capability and Control − 2020-08-10 - https://​​www.cser.ac.uk/​​resources/​​exploring-ai-safety-degrees-generality-capability-and-control/​​ Byun, Jungwon, Stuhlmuller, Andreas—Automating reasoning about the future at Ought − 2020-11-09 - https://​​ought.org/​​updates/​​2020-11-09-forecasting Carey, Ryan; Langlois, Eric; Everitt, Tom; Legg, Shane—The Incentives that Shape Behaviour − 2020-01-20 - https://​​arxiv.org/​​abs/​​2001.07118 Carlsmith, Joseph—How Much Computational Power Does It Take to Match the Human Brain? − 2020-09-11 - https://​​www.openphilanthropy.org/​​brain-computation-report Cave, Stephen; Dihal, Kanta—The Whiteness of AI − 2020-08-06 - http://​​lcfi.ac.uk/​​resources/​​whiteness-ai/​​ Christian, Brian—The Alignment Problem: Machine Learning and Human Values − 2020-09-06 - https://​​www.amazon.com/​​Alignment-Problem-Machine-Learning-Values-ebook/​​dp/​​B085T55LGK/​​ref=tmm_kin_swatch_0?encoding=UTF8&qid=&sr= Christiano, Paul—"Unsupervised" translation as an (intent) alignment problem − 2020-09-29 - https://​​ai-alignment.com/​​unsupervised-translation-as-a-safety-problem-99ae1f9b6b68 Cihon, Peter; Maas, Matthijs M.; Kemp, Luke—Should Artificial Intelligence Governance be Centralised? Design Lessons from History − 2020-01-10 - https://​​arxiv.org/​​abs/​​2001.03573 Clarke, Sam—Clarifying "What failure looks like" (part 1) − 2020-09-20 - https://​​www.alignmentforum.org/​​posts/​​v6Q7T335KCMxujhZu/​​clarifying-what-failure-looks-like-part-1 Clifton, Jesse—Equilibrium and prior selection problems in multipolar deployment − 2020-04-02 - https://​​www.alignmentforum.org/​​posts/​​Tdu3tGT4i24qcLESh/​​equilibrium-and-prior-selection-problems-in-multipolar-1#comments Clifton, Jesse; Riche, Maxime—Towards Cooperation in Learning Games − 2020-11-15 - https://​​longtermrisk.org/​​files/​​toward_cooperation_learning_games_oct_2020.pdf Cohen, Michael; Hutter, Marcus—Curiosity Killed the Cat and the Asymptotically Optimal Agent − 2020-06-05 - https://​​arxiv.org/​​abs/​​2006.03357 Cohen, Michael; Hutter, Marcus—Pessimism About Unknown Unknowns Inspires Conservatism − 2020-06-15 - https://​​arxiv.org/​​abs/​​2006.08753 Cotra, Ajeya—Report on AI Timelines − 2020-10-18 - https://​​www.alignmentforum.org/​​posts/​​KrJfoZzpSDpnrv9va/​​draft-report-on-ai-timelines Cotton‐Barratt, Owen; Daniel, Max; Sandberg, Anders; - Defence in Depth Against Human Extinction: Prevention, Response, Resilience, and Why They All Matter − 2020-01-24 - https://​​onlinelibrary.wiley.com/​​doi/​​full/​​10.1111/​​1758-5899.12786 Cremer, Carla; Whittlestone, Jess—Canaries in Technology Mines: Warning Signs of Transformative Progress in AI − 2020-09-24 - https://​​www.fhi.ox.ac.uk/​​publications/​​canaries-in-technology-mines-warning-signs-of-transformative-progress-in-ai-cremer-and-whittlestone/​​ Critch, Andrew—Some AI research areas and their relevance to existential safety − 2020-11-18 - https://​​www.alignmentforum.org/​​posts/​​hvGoYXi2kgnS3vxqb/​​some-ai-research-areas-and-their-relevance-to-existential-1 Critch, Andrew; Krueger, David—AI Research Considerations for Human Existential Safety (ARCHES) − 2020-05-30 - https://​​arxiv.org/​​abs/​​2006.04948 Crosby, Matthew; Beyret, Benjamin; Shanahan, Murray; Hernández-Orallo, José; Cheke, Lucy; Halina, Marta—The Animal-AI Testbed and Competition − 2020-09-22 - http://​​lcfi.ac.uk/​​resources/​​animal-ai-testbed-and-competition-paper-purblished/​​ Demski, Abram—Radical Probabilism − 2020-08-18 - https://​​www.lesswrong.com/​​s/​​HmANELvkhAZ9eDxFS/​​p/​​xJyY5QkQvNJpZLJRo Ding, Jeffrey; Dafoe, Allan—The Logic of Strategic Assets: From Oil to AI − 2020-01-09 - https://​​arxiv.org/​​ftp/​​arxiv/​​papers/​​2001/​​2001.03246.pdf Freedman, Rachel; Shah, Rohin; Dragan, Anca—Choice Set Misspecification in Reward Inference − 2020-09-10 - http://​​ceur-ws.org/​​Vol-2640/​​paper_14.pdf Gabriel, Iason—Artificial Intelligence, Values and Alignment − 2020-01-13 - https://​​arxiv.org/​​abs/​​2001.09768 Garfinel, Ben—Does Economic History Point Towards a Singularity? − 2020-09-02 - https://​​forum.effectivealtruism.org/​​posts/​​CWFn9qAKsRibpCGq8/​​does-economic-history-point-toward-a-singularity Garrabrant, Scott—Cartesian Frames − 2020-10-22 - https://​​www.alignmentforum.org/​​s/​​2A7rrZ4ySx6R8mfoT Gleave, Adam; Dennis, Michael; Legg, Shane; Russell, Stuart; Leike, Jan—QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS − 2020-10-08 - https://​​arxiv.org/​​abs/​​2006.13900 Grace, Katja—Atari early − 2020-04-01 - https://​​aiimpacts.org/​​atari-early/​​ Grace, Katja—Discontinuous progress in history: an update − 2020-04-13 - https://​​aiimpacts.org/​​discontinuous-progress-in-history-an-update/​​ Halpern, Joseph; Piermont, Evan—Dynamic Awareness − 2020-07-06 - https://​​arxiv.org/​​abs/​​2007.02823 hÉigeartaigh, Seán Ó; Whittlestone, Jess; Liu, Yang; Zeng, Yi; Liu, Zhe—Overcoming Barriers to Cross-cultural Cooperation in AI Ethics and Governance − 2020-05-15 - https://​​link.springer.com/​​article/​​10.1007/​​s13347-020-00402-x Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob—Aligning AI with Shared Human Values − 2020-08-05 - https://​​arxiv.org/​​abs/​​2008.02275 Henighan, Tom; Kaplan, Jared; Katz, Mor; Chen, Mark; Hesse, Christopher; Jackson, Jacob; Jun, Heewoo; Brown, Tom B.; Dhariwal, Prafulla; Gray, Scott; Hallacy, Chris; Mann, Benjamin; Radford, Alec; Ramesh, Aditya; Ryder, Nick; Ziegler, Daniel M.; Schulman, John; Amodei, Dario; McCandlish, Sam—Scaling Laws for Autoregressive Generative Modeling − 2020-11-06 - https://​​arxiv.org/​​abs/​​2010.14701?fbclid=IwAR3H-kH2TKQXl4GcVGLXsfZv2JfD_mOlRdQfXFuAZDttPoHMRKyHITgo74 Hernandez-Orallo, Jose; Martinez-Plumed, Fernando; Avin, Shahar; Whittlestone, Jess; hÉigeartaigh, Seán Ó - AI Paradigms and AI Safety: Mapping Artefacts and Techniques to Safety Issues − 2020-08-10 - https://​​www.cser.ac.uk/​​resources/​​ai-paradigms-and-ai-safety-mapping-artefacts-and-techniques-safety-issues/​​ Hollanek, Tomasz—AI transparency: a matter of reconciling design with critique − 2020-11-17 - https://​​link.springer.com/​​article/​​10.1007%2Fs00146-020-01110-y#author-information Hubinger, Evan—An overview of 11 proposals for building safe advanced AI − 2020-05-29 - https://​​www.alignmentforum.org/​​posts/​​fRsjBseRuvRhMPPE5/​​an-overview-of-11-proposals-for-building-safe-advanced-ai Hwang, Tim—Shaping the Terrain of AI Competition − 2020-06-15 - https://​​cset.georgetown.edu/​​research/​​shaping-the-terrain-of-ai-competition/​​ Imbrie, Andrew; Kania, Elsa; Laskai, Lorand—The Question of Comparative Advantage in Artificial Intelligence: Enduring Strengths and Emerging Challenges for the United States − 2020-01-15 - https://​​cset.georgetown.edu/​​research/​​the-question-of-comparative-advantage-in-artificial-intelligence-enduring-strengths-and-emerging-challenges-for-the-united-states/​​ John, Tyler; MacAskill, William—Longtermist institutional reform − 2020-07-30 - https://​​philpapers.org/​​rec/​​JOHLIR Kemp, Luke; Rhodes, Catherine—The Cartography of Global Catastrophic Risks − 2020-01-06 - https://​​www.cser.ac.uk/​​resources/​​cartography-global-catastrophic-governance/​​ Kokotajlo, Daniel—Relevant pre-AGI possibilities − 2020-06-18 - https://​​aiimpacts.org/​​relevant-pre-agi-possibilities/​​ Kokotajlo, Daniel—Three kinds of competitiveness − 2020-03-30 - https://​​aiimpacts.org/​​three-kinds-of-competitiveness/​​ Korzekwa, Rick—Description vs simulated prediction − 2020-04-22 - https://​​aiimpacts.org/​​description-vs-simulated-prediction/​​ Korzekwa, Rick—Preliminary survey of prescient actions − 2020-04-08 - https://​​aiimpacts.org/​​survey-of-prescient-actions/​​ Kovařík, Vojtěch ; Carey, Ryan - (When) Is Truth-telling Favored in AI Debate? − 2019-12-15 - https://​​arxiv.org/​​abs/​​1911.04266 Krakovna, Victoria—Possible takeaways from the coronavirus pandemic for slow AI takeoff − 2020-05-31 - https://​​vkrakovna.wordpress.com/​​2020/​​05/​​31/​​possible-takeaways-from-the-coronavirus-pandemic-for-slow-ai-takeoff/​​ Krakovna, Victoria; Orseau, Laurent; Ngo, Richard; Martic, Miljan; Legg, Shane—Avoiding Side Effects By Considering Future Tasks − 2020-10-15 - https://​​arxiv.org/​​abs/​​2010.07877v1 Krakovna, Victoria; Uesato, Jonathan; Mikulik, Vladimir; Rahtz, Matthew; Everitt, Tom; Kumar, Ramana; Kenton, Zac; Leike, Jan; Legg, Shane—Specification gaming: the flip side of AI ingenuity − 2020-04-21 - https://​​deepmind.com/​​blog/​​article/​​Specification-gaming-the-flip-side-of-AI-ingenuity Lehman, Joel—Reinforcement Learning Under Moral Uncertainty − 2020-06-15 - https://​​arxiv.org/​​abs/​​2006.04734 Linsefors, Linda & Hepburn, JJ—Announcing AI Safety Support − 2020-11-19 - https://​​forum.effectivealtruism.org/​​posts/​​wpQ2qhF8Z6oonsaPX/​​announcing-ai-safety-support MacAskill, Will—Are we living at the hinge of history? − 2020-09-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​William-MacAskill_Are-we-living-at-the-hinge-of-history.pdf Makiievskyi, Anton; Zhou, Liang ; Chiswick, Max—Assessing Generalization in Reward Learning with Procedurally Generated Games − 2020-08-30 - https://​​towardsdatascience.com/​​assessing-generalization-in-reward-learning-intro-and-background-da6c99d9e48 Mogensen, Andreas—Moral demands and the far future − 2020-06-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Working-Paper-1-2020-Andreas-Mogensen.pdf Mogensen, Andreas; Thorstad, David—Tough enough? Robust satisficing as a decision norm for long-term policy analysis − 2020-11-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Tough-Enough_Andreas-Mogensen-and-David-Thorstad.pdf Ngo, Richard—AGI Safety from First Principles − 2020-09-28 - https://​​www.alignmentforum.org/​​s/​​mzgtmmTKKn5MuCzFJ Nguyen, Chi; Christiano, Paul—My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda − 2020-08-15 - https://​​www.lesswrong.com/​​posts/​​PT8vSxsusqWuN7JXp#comments O’Keefe, Cullen; Cihon, Peter; Garfinkel, Ben; Flynn, Carrick; Leung, Jade; Dafoe,Allan—The Windfall Clause: Distributing the Benefits of AI for the Common Good − 2020-01-30 - https://​​www.fhi.ox.ac.uk/​​windfallclause/​​ O’Brien, John; Nelson, Cassidy—Assessing the Risks Posed by the Convergence of Artificial Intelligence and Biotechnology − 2020-06-17 - https://​​www.liebertpub.com/​​doi/​​full/​​10.1089/​​hs.2019.0122 O’Keefe, Cullen—How will National Security Considerations affect Antitrust Decisions in AI? An Examination of Historical Precedents − 2020-07-28 - https://​​forum.effectivealtruism.org/​​out?url=https%3A%2F%2Fwww.fhi.ox.ac.uk%2Fwp-content%2Fuploads%2FHow-Will-National-Security-Considerations-Affect-Antitrust-Decisions-in-AI-Cullen-OKeefe.pdf Ord, Toby—The Precipice − 2020-03-24 - https://​​www.amazon.com/​​Precipice-Existential-Risk-Future-Humanity-ebook/​​dp/​​B07V9GHKYP/​​ref=tmm_kin_swatch_0?encoding=UTF8&qid=&sr= Peters, Dorian; Vold, Karina; Robinson, Diana; Calvo, Rafael—Responsible AI—Two Frameworks for Ethical Design Practice − 2020-02-15 - https://​​ieeexplore.ieee.org/​​document/​​9001063/​​authors#authors Prunkl, Carina; Whittlestone, Jess—Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society − 2020-01-13 - https://​​arxiv.org/​​abs/​​2001.04335 Qian, Shi; Hui, Li; Tse, Brian; Hopcroft, John; Russell, Stuart; Jeanmaire, Caroline; Qiang, Yang; Fung, Pascale; Yampolskiy, Roman; Dafoe, Allan; Anderljung, Markus; Hadfield, Gillian; Wright, Don; Brundage, Miles; Clark, Jack; Solaiman, Irene; Krueger, Gretchen; O’ hEigeartaigh, Sean; Toner, Helen; Liu, Millie; Hoffman, Steve; Beridze, Irakli; Wallach, Wendell; Hodes, Cyrus; Miailhe, Nicolas; Newman, Jessica; Dingding, Chen; Kaili, Eva; Jun, Su; Hagendorff, Thilo; Ahrweiler, Petra; Williams, Robin; Allen, Colin; Wang, Poon; Carbonell, Ferran; Ziaohong, Wang; Qingfend, Yang; Qi, Yin; Rossie, Francesca; Stix, Charlotte; Daly, Angela; Gal, Danit; Ema, Arisa; Yihan, Goh; Remolina, Nydia; Aneja, Urvashi; Ying, Fu; Zhiyun, Zhao; Xiuquan, Li; Weiwen, Duan; Qun, Luan; Rui, Guo; Yingchun, Wang—AI GOVERNANCE IN 2019 A YEAR IN REVIEW − 2020-04-15 - https://​​www.aigovernancereview.com/​​ Reddy, Siddharth; Dragan, Anca D.; Levine, Sergey; Legg, Shane; Leike, Jan—Learning Human Objectives by Evaluating Hypothetical Behavior − 2019-12-05 - https://​​arxiv.org/​​abs/​​1912.05652 Russell, Stuart; Norvig, Peter—Artificial Intelligence: A Modern Approach, 4th Edition − 2020-01-01 - https://​​www.pearson.com/​​us/​​higher-education/​​program/​​Russell-Artificial-Intelligence-A-Modern-Approach-4th-Edition/​​PGM1263338.html Saunders, William; Rachbach, Ben; Evans, Owain; Byun, Jungwon; Stuhlmüller, and Andreas—Evaluating Arguments One Step at a Time − 2020-01-11 - https://​​ought.org/​​updates/​​2020-01-11-arguments Scholl, Keller; Hanson, Robin—Testing the Automation Revolution Hypothesis − 2019-12-10 - https://​​papers.ssrn.com/​​sol3/​​papers.cfm?abstract_id=3496364 Shah, Rohin—AI Alignment 2018-19 Review − 2020-01-27 - https://​​www.alignmentforum.org/​​posts/​​dKxX76SCfCvceJXHv/​​ai-alignment-2018-19-review#Short_version___1_6k_words Shevlane, Toby; Dafoe, Allan—The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse? − 2020-12-27 - https://​​arxiv.org/​​abs/​​2001.00463 Snyder-Beattie, Andrew; Sandberg, Anders; Drexler, Eric; Bonsall, Michael—The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare − 2020-11-19 - https://​​www.liebertpub.com/​​doi/​​full/​​10.1089/​​ast.2019.2149 Stiennon, Nisan; Ouyang, Long; Wu, Jeff; Ziegler, Daniel M.; Lowe, Ryan; Voss, Chelsea; Radford, Alec; Amodei, Dario; Christiano, Paul—Learning to Summarize with Human Feedback − 2020-09-04 - https://​​openai.com/​​blog/​​learning-to-summarize-with-human-feedback/​​ Tarsney, Christian—Exceeding Expectations: Stochastic Dominance as a General Decision Theory − 2020-08-08 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Christian-Tarsney_Exceeding-Expectations_Stochastic-Dominance-as-a-General-Decision-Theory.pdf Tarsney, Christian; Thomas, Teruji—Non-Additive Axiologies in Large Worlds − 2020-09-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Christian-Tarsney-and-Teruji-Thomas_Non-Additive-Axiologies-in-Large-Worlds.pdf Thorstad, David; Mogensen, Andreas—Heuristics for clueless agents: how to get away with ignoring what matters most in ordinary decision-making − 2020-06-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​David-Thorstad-Andreas-Mogensen-Heuristics-for-clueless-agents.pdf Trammell, Philip; Korinek, Anton—Economic growth under transformative AI − 2020-10-08 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Philip-Trammell-and-Anton-Korinek_Economic-Growth-under-Transformative-AI.pdf Tucker, Aaron; Anderljung, Markus; Dafoe, Allan—Social and Governance Implications of Improved Data Efficiency − 2020-01-14 - https://​​arxiv.org/​​pdf/​​2001.05068.pdf Tzachor, Asaf; Whittlestone, Jess; Sundaram, Lalitha; , Seán Ó hÉigeartaigh—Artificial intelligence in a crisis needs ethics with urgency − 2020-12-02 - https://​​www.nature.com/​​articles/​​s42256-020-0195-0 Uesato, Jonathan; Kumar, Ramana; Krakovna, Victoria; Everitt, Tom; Ngo, Richard; Legg, Shane—Avoiding Tampering Incentives in Deep RL via Decoupled Approval − 2020-11-17 - https://​​arxiv.org/​​abs/​​2011.08827 Wilkinson, Haydn—In defence of fanaticism − 2020-08-01 - https://​​globalprioritiesinstitute.org/​​wp-content/​​uploads/​​Hayden-Wilkinson_In-defence-of-fanaticism.pdf Xu, Jing; Ju, Da; Li, Margaret; Boureau, Y-Lan; Weston, Jason; Dinan, Emily—Recipes for Safety in Open-domain Chatbots − 2020-09-14 - https://​​arxiv.org/​​abs/​​2010.07079 Zerilli, John; Knott, Alistair; Maclaurin, James; Colin, Gavaghan—Algorithmic Decision-Making and the Control Problem − 2020-12-11 - https://​​link.springer.com/​​article/​​10.1007%2Fs11023-019-09513-7

Comment

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=gM2frCaW4LWrqZhTK

OpenAI was initially funded with money from Elon Musk as a not-for-profit.

This is commonly said on the basis of his $1b pledge, but AFAICT Musk wound up contributing little or nothing before he resigned ~2018. If you look at the OA Form 990s, Musk is never listed as a donor, only a board member; the only entities that are listed as contributing money or loans are Sam Altman, Y Combinator Research, and OpenAI LP.

Comment

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=9ahujsKmfWvKHofBS

This is commonly said on the basis of his $1b pledge Wasn’t it supposed to be a total of $1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk? EDIT: yes, it was. Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed $1 billion, although we expect to only spend a tiny fraction of this in the next few years. https://​​openai.com/​​blog/​​introducing-openai/​​

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=ZGv9wBwjeazdobrgy

He’s definitely given some money, and I don’t think the 990 absence means much. From here:

in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, YC.org. [...] The Musk Foundation’s grant accounted for the majority of YC.org’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.

Also, when he quit in 2018, OpenAI wrote "Elon Musk will depart the OpenAI Board but will continue to donate and advise the organization". The same blog post lists multiple other donors than Sam Altman, so donating to OpenAI without showing up on the 990s must be the default, for some reason.

Comment

That’s interesting. I did see YC listed as a major funding source, but given Sam Altman’s listed loans/​donations, I assumed, because YC has little or nothing to do with Musk, that YC’s interest was Altman, Paul Graham, or just YC collectively. I hadn’t seen anything at all about YC being used as a cutout for Musk. So assuming the Guardian didn’t screw up its understanding of the finances there completely (the media is constantly making mistakes in reporting on finances and charities in particular, but this seems pretty detailed and specific and hard to get wrong), I agree that that confirms Musk did donate money to get OA started and it was a meaningful sum.

But it still does not seem that Musk donated the majority or even plurality of OA donations, much less the $1b constantly quoted (or any large fraction of the $1b collective pledge, per ESRogs).

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=CPPkfLnRc4yyfru7p

the only entities that are listed as contributing money or loans are Sam Altman, Y Combinator Research, and OpenAI LP Possible that he funded OpenAI LP? Or was that only created later, and funded by Microsoft and other non-founding investors?

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=tzPKS8LyncDuq7pm4

AI Impacts now has a 2020 review page so it’s easier to tell what we’ve done this year—this should be more complete /​ representative than the posts listed above. (I appreciate how annoying the continuously updating wiki model is.)

Comment

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=KnNa7uuRrGxDKNFGG

Thanks, added.

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=N3W6RGf4MEBahxCNf

Promoted to curated: Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. Thank you for your amazing work, as this continues to be an invaluable resource to anyone trying to navigate the AI Alignment landscape, whether as a researcher, grantmaker or independent thinker.

Comment

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=Y59XJai7Pbb82EwC5

Even if God and Santa Claus are not real, we do experience a Christmas miracle every year in the form of these amazingly thorough reviews by Larks. What an amazing sentence.

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=GknsgbwrLBRTP44Ks

hurrah! victory for larks, with yet another comprehensive review! how long can he keep it up? another decade? i hope so! (Also I had 3 laugh-out-loud moments. I will let the studious reader find all your hidden jokes.)

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=FLLRRxegAbJeXk7TN

no AI safety relevant publications in 2019 or 2020, and only one is a coauthor on what I would consider a highly relevant paper. Context: I’m an OpenPhil fellow who is doing work on robustness, machine ethics, and forecasting. I published several papers on the research called for in Concrete Problems in AI Safety and OpenPhil’s/​Steinhardt’s AI Alignment Research Overview. The work helped build a trustworthy ML community and aimed at reducing accident risks given very short AI timelines. Save for the first paper I helped with (when I was trying to learn the ropes), the motivation for the other dozen or so papers was always safety. These papers have nothing to do with RL and are about DL, and they do not fit in with the type of technical research shared on this forum, which might be why these are not considered "highly relevant." Some (not all) of the OpenPhil fellows are working on safety, though with OpenPhil’s broader research agenda.

Comment

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=QAka7q8Y6HKaPmcP3

Hey Daniel, thanks very much for the comment. In my database I have you down as class of 2020, hence out of scope for that analysis, which was class of 2018 only. I didn’t include 2019 or 2020 classes because I figured it takes time to find your footing, do research, write it up etc., so absence of evidence would not be very strong evidence of absence. So please don’t consider this as any reflection on you. Ironically I actually did review one of your papers in the above—this one—which I did indeed think was pretty relevant! (Cntrl-F ‘Hendrycks’ to find the paragraph in the article). Sorry if this was not clear from the text.

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=ZYdaho2mfN6k5PKZ6

Planned summary for the Alignment Newsletter:

<@The tradition continues@>(@2019 AI Alignment Literature Review and Charity Comparison@)! I’ll say nearly the same thing as I did last year:This mammoth post goes through the work done within AI alignment from December 2019 - November 2020, from the perspective of someone trying to decide which of several AI alignment organizations to donate to. As part of this endeavor, Larks summarizes several papers that were published at various organizations, and compares them to their budget and room for more funding. Planned opinion: I look forward to this post every year. It continues to be a stark demonstration of how much work doesn’t get covered in this newsletter—while I tend to focus on the technical alignment problem, with some focus on AI governance and AI capabilities, this literature review spans many organizations working on existential risk, and as such has many papers that were never covered in this newsletter. Anyone who wants to donate to an organization working on AI alignment and/​or x-risk should read this post. Last year I mentioned I might write an overview for the sake of building inside view models (rather than donation decisions), this went out shortly afterward (AN #84). I don’t expect to write a similar post this year, partly because I think last year’s post is still quite good as an overview of the discussion that’s been happening.

https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison?commentId=9X4CvhAxudbPJkS4K

Re: rot13 and donation decisions: I’m not worried about CHAI as a whole getting ‘distracted’, at least in the next couple of years. There are several people still at CHAI who are very definitively focused on existential risk—Andrew Critch, Adam Gleave, and Daniel Filan, for example. It’s also led by Stuart Russell, who is probably the figure most associated with AI x-risk research amongst the AI research community, with perhaps the exception of Nick Bostrom.If you think that I’m better at reducing x-risk than they are (e.g. because my beliefs are better), then that seems like a plausible reason to worry. I do expect that the research that CHAI does will be somewhat different now that I am not there; I have many disagreements with many people at CHAI. But I wouldn’t worry about them being distracted.I personally donated to CHAI this year; part of my reasoning was that CHAI could use unrestricted funding for research engineering roles. (Note that I am now at DeepMind, so I was less worried about conflicts of interest, though obviously I still have many friends at CHAI.)

Poor Quality Research I wanted to note I agree with most of this section, and have adopted a similar policy. I do think there are more downsides to calling out poor research—the main one I’d note is that in general because communication is hard people (or at least I) tend to be systematically too negative about other people’s research. (See previous link for more details.)