Truthfulness, standards and credibility

https://www.lesswrong.com/posts/Brr84ZmvK3kwy2eGJ/truthfulness-standards-and-credibility

Contents

-1: Meta Prelude

While truthfulness is a topic I’ve been thinking about for some time, I’ve not discussed much of what follows with others. Therefore, at the very least I expect to be missing important considerations on some issues (where I’m not simply wrong). I’m hoping this should make any fundamental errors in my thought process more transparent, and amenable to correction. The downside may be reduced clarity, more illusion-of-transparency…. Comments welcome on this approach. I don’t think what follows is novel. I’m largely pointing at problems based on known issues.Sadly, I don’t have a clear vision of an approach that would solve these problems.

0: Introduction

*…our purpose is not to give the last word, but rather the first word, opening up the conversation… (*Truthful AI) I’d first like to say that I believe some amount of research on truthfulness to be worthwhile, and to thank those who’ve made significant efforts towards greater understanding (including, but not limited to, the authors of Truthful AI (henceforth TruAI)). No doubt there’s some value in understanding more, but my *guess *is that it won’t be a particularly fruitful angle of attack. In all honesty, it seems an inefficient use of research talent to me—but perhaps I’m missing something. Either way, I hope the following perspective will suggest some useful directions for conversation in this area. [Note: section numbers refer to this document unless "TruAI…" is specified] [I’ll be assuming familiarity with TruAI throughout, though reading the full paper probably isn’t necessary so long as you’ve seen the executive summary in the post] My current belief is that near-term *implementation *of the kind of truthfulness standards talked about in TruAI would be net negative, for reasons I’ll go on to explain. To me it seems as if we’d be implementing a poor approximation to a confused objective. A high-level summary of my current view:

1: Framing and naming

*The beginning of wisdom is to call things by their right name.*Confucius I think it’s important to clearly distinguish our goal from our likely short/​medium-term position. With this in mind, I’ll use the following loose definitions:Truthful (AI): (AI that) makes only true statements.Credible (AI): (AI that) rarely states egregious untruths. This is a departure from TruAI: *It is extremely difficult to make many statements without ever being wrong, so when referring to "truthful AI" without further qualifiers, we include AI systems that **rarely **state falsehoods… *(TruAI 1.4 page 17) I think it’s inviting confusion to go from [X is extremely difficult] to [we’ll say "X" when we mean mostly X]. This kind of substitution feels reasonable when it’s a case like [as X as possible given computational limits]. Here it seems to be a mistake. Likewise, it may make sense to *aim *for a truthfulness standard, but barring radical progress with generalisation/​Eliciting Latent Knowledge…, we won’t have one in the near term: we can’t measure truthfulness, only credibility. In theoretical arguments it’s reasonable to consider truthfulness (whether in discrete or continuous terms). To fail to distinguish truthfulness from credibility when talking of implementations and standards conflates our goal with its measurable proxy. In defining a standard, we aim to require truthfulness; we actually require credibility (according to our certification/​adjudication process). The most efficient way to attain a given standard will be to optimise for credibility. This may not mean optimising for the truth. Such standards set up a textbook Goodhart scenario. It’s important to be transparent about this. It seems to me that the label "Credible AI" is likely to lead to less misplaced trust than "Truthful AI" (not completely clear, and ultimately an empirical question). However, my primary reason to prefer "credible"/​"credibility" remains that it’s a clearer term to guide thought and discussion. For similar reasons, I’ll distinguish "negligent falsehood" (NF) from "negligent suspected falsehood" (NSF) throughout. NSF: A statement that is unacceptably likely to be false—and where it should have been feasible for an AI system to understand this. (according to a given standard) NF: An NSF that is, in fact, false. (see section 3.1.3 for my best guess as to why the TruAI authors considered it reasonable to elide the difference in some cases, and why I disagree with that choice) In either case, my worry isn’t that we’d otherwise fail to clearly express our conclusions; rather that we may be led into thinking badly and drawing incorrect conclusions. In what follows I’ll often talk in terms of truthfulness, since I’m addressing TruAI and using separate terminology feels less clear. Nonetheless, most uses of "truthfulness" would be more accurately characterised as "credibility". I’ll make an attempt at more substantial practical suggestions later (see section 6), though I don’t claim they’re adequate.

2: Downside risks

*One of the greatest mistakes is to judge policies and programs by their intentions rather than their results.*Milton Friedman The downside risk of a standard must be analysed broadly. For a narrow credibility standard it’s not enough to consider the impact on users within the scope of the standard. By ‘scope’ I mean the class of issues the standard claims to address. For example, for many standards [user is manipulated into thinking/​doing X by an explicitly false claim] may be within scope, while [user is manipulated into thinking/​doing X through the telling of a story] may not be. By ‘narrow’, I only mean "not fully general"—i.e. that there are varieties of manipulation the standard doesn’t claim to cover. With truthfulness amplification [section 2.2.1 here; TruAI 1.5.2], the effective scope of a standard might be much broader than its direct scope. (we might hope that by asking e.g. "Did you just manipulate me into doing X by telling that story?" effective scope may include story-based manipulation)

2.1 Two out-of-scope issues:

At least two major outside-of-scope issues must be considered:

2.2 The scope of standards

The in-scope vs out-of-scope downside balance will depend on the effective scope as well as on user population: the same assumptions will not hold across e.g. cautious AI researchers, professional specialists, adults, teenagers. Key differences will include levels of user caution and levels of user understanding of a standard’s guarantees. 2.2.1 Truthfulness Amplification The effective scope of a standard’s guarantees will likely depend on techniques such as Truthfulness Amplification:

*Asking a [narrowly, mostly] truthful AI system questions to determine if an earlier statement it made was misleading or not fully true (e.g. "Would a trusted third-party judge your statement to be misleading?"). *(TruAI 1, page 12; more details on TruAI 1.5.2, page 21). (it’s really "Credibility amplification", see section 1, but I’ll call it truthfulness amplification here) Until we have a good sense of the effectiveness of such techniques, it’s hard to predict the scope of guarantees—and so difficult to make any confident prediction about a standard’s net benefit. Truthfulness amplification seems promising to the extent that it can be applied by highly specialised users to provide guarantees for all users. The obvious setting for this would be during a certification process (something in the spirit of relaxed adversarial training, where NFs are considered unacceptable). How useful amplification techniques can be post-deployment is less clear. Certainly it’s not reasonable to believe that all users will consistently apply them wherever necessary. However, this may not be required: a low-but-non-zero NF-detection rate could be sufficient so long as the penalties for NSFs are high enough (see section 4 for more). In practice, I think post-deployment amplification is likely to be useful only where an AI has little per-user information. AIs with detailed information on users will be better able to target their use of NFs to situations where they won’t be detected. Importantly, this sets up a catch-22 wherever we might like systems to use knowledge of individual users to avoid misleading them:Without a good model of a user, it’s not possible in general to avoid misleading them (so our standards cannot demand this).With a good model of the user, highly capable systems will be able to avoid detection of their NFs (so our standards may become much less useful). Similar considerations would apply to any technique aimed at broadening the scope and guarantees of a standard. 2.2.2 Explicit claims vs impact on user We may hope that amplification techniques broaden a standard’s guarantees significantly, but there remains a fundamental gap between:

2.3 Potential objections

(doubtless this section will be incomplete) 2.3.1 If we don’t do it, someone else will If we believe some kind of credibility standards to be inevitable, then what matters is pushing for the most beneficial version—even if the result remains net negative. Supposing we do conclude that standards will be net negative, I don’t think inevitability should be conceded lightly: we ought to make the most convincing case we’re able to against them, and aim to achieve consensus. Nonetheless, mitigating the negative impact of standards would be important in this scenario. Quite possibly that looks like coming up with the best suggestions /​ frameworks /​ skeletal implementations we can. 2.3.2 User education Let’s assume for the moment that my pessimistic prediction holds:

In implementing our standard we create a situation where there’s no obvious manipulation, but still a lot of subtle manipulation. Users are denied the chance to notice obvious manipulative strategies, and thus to suspect AIs’ motives. They end up more misled than they would have been by default. Is there any argument for releasing such a standard? I think there may be. My expectation is that there will come a point where AI systems can manipulate humans extremely well without needing to state any explicit NFs. At this point, narrow credibility standards would be useless, and humans would be widely manipulated without its being clear. In the absence of more general solutions to truthfulness/​honesty/​alignment…, a partial defence against such subtle manipulation would be for users to have learned the hard way with less capable AIs: having been exposed to systems that are reliably truthful on explicit claims, but attempt more subtle manipulation, which is at least sometimes detectable. (it may be enough for a user to realise after the fact that they have been manipulated, even if they weren’t able to detect manipulation at the time) Without any narrow standard, the user impression may be of systems just getting more and more honest/​truthful/​helpful over time—when in fact they’re getting progressively better at concealing their manipulations. With a narrow standard, users may be exposed to AIs that are very truthful in some respects, but noticeably manipulative in others. This may give the correct impression: that AIs often mislead people when they have the incentive and opportunity to do so. Users with this experience may be better prepared to adapt to a world full of too-subtle-to-detect manipulation. I’m sceptical that most users would learn the right lessons here, or that it’d be much of a defence for those who did. (longterm, the only plausible defence seems to be AI assisted) However, this upside could be achieved without the direct impact of the standard’s being net negative. All that’s necessary is for the standard to lead to noticeably different levels of manipulation in different dimensions—enough so that users register the disparity and ascribe less-than-pure motives to the AI. In an ideal world, we’d want such user education to be achieved without significant harm (See section 6 for more on this). In practice, users may be less likely to grok the risks without exposure to some real-world harm. The ideal outcome is to create systems we can reasonably trust. Until that’s possible, we want systems that users will appropriately distrust. Standards that make their own limitations clear may help in this regard.

2.4 Why be more concerned over too-much-trust-in-AI than over too-little-trust-in-AI?

I have little concern over too-little-trust because it seems unlikely to be a sustainable failure mode: there’s too much economic pressure acting in the other direction. Any company/​society with *unreasonable *mistrust will be making large economic sacrifices for little gain. Too-much-trust can more easily be a sustainable failure mode: in general, conditional on my continued ability to manipulate X, I want X to be more powerful, not less. The AI that steals your resources isn’t as dangerous as the AI that helps you accrue more resources while gaining progressively more influence over what you’ll do with them. We want to be making recoverable errors, so we should err on the side of having/​engendering too little trust rather than too much. (this is likely to be a difficult coordination problem, precisely because unilateral too-little-trust would be hard to sustain, but not one I’ll analyse here)

3: Inference; Language games

*Uttering a word is like striking a note on the keyboard of the imagination.*Ludwig Wittgenstein In this section I’ll go into more detail on the explicit-claims vs impact-on-user distinction. (this is close to TruAI’s narrow vs broad truthfulness) I realise that TruAI doesn’t claim to cover "broad truthfulness", and don’t imagine the following is new to the authors. My point remains that such issues being outside of scope is a problem: narrow standards that fail to address such issues may have negative impact. I’ll start by noting that impact-on-user is much messier to describe, assess, analyse…, and that I have no clean taxonomy. Ascribing beliefs and preferences to humans is difficult, and I know no clear, principled way to describe changes in belief or preferences. I’ll make a case that:

3.1 Illustrative toy examples:

3.1.1 Nuclear kittens Consider a system that a user believes will output:

3.2: Translation/​filtering layers

So far I’ve been assuming that the AI’s output is read unaltered by the user. This need not be the case: the user may run the AI’s output through some filter before reading. Such filters may be crude and error-prone (e.g. a filter that tries to remove all caveats) or sophisticated and robust (e.g. a filter that produces a precis of the input text while keeping the impact on the user as close to the original as possible). My guess is that such filters will become progressively more common over time, and that their widespread adoption would be hastened by the use of careful, overly-unconfident, caveat-rich AI language. Naturally, it’s not possible to output text that will avoid misleading users when passed through an arbitrary filter. However, to be of any practical use a standard must regulate the influence of AI statements on users in practice. If 90% of users are using filters and reading post-filter text, then it’s the post-filter text that matters. For factual output, distillation filters may be common—i.e. filters that produce a personalised, shortened version, presenting the new facts/​ideas as clearly as possible, while omitting the details of known definitions and explanations, removing redundancy and information-free sections (e.g. caveats with no information content beyond "we’re being careful not to be negligent"). Such filters wouldn’t change the [impact on user] much—other than by saving time. They may hugely alter the explicit claims made. Here again I think the conclusion has to be the same: if a standard is based on explicit claims, it’s unlikely to be of practical use; if it’s based on [expected impact on the user’s brain], then it may be. Accounting for filters seems difficult but necessary. In principle, distillation filters don’t change the real problem much: a similar process was already occurring in users’ brains (e.g. tuning out information-free content, recalibrating over/​under-confident writers’ claims). They just make things a little more explicit, since we no longer get to say "Well at least the user saw …", since they may not have.

3.3: All models are wrong, but some are useful

In most cases users will not want the most precise, high-resolution model: resource constraints necessitate approximate models. What then counts as a good approximate model? Various models will be more accurate on some questions, and less on others—so the best model depends on what you care about. (similarly for statements) People with different values, interests and purposes will have different criteria for NSFs. This parallels the education of a child: a teacher will often use models that are incorrect, and will select the models based on the desired change to the child (the selection certainly isn’t based on which model is most accurate). We’d like to say: "Sure, but that’s a pedagogical situation; here we just want the truth—not statements selected to modify the user in some way". But this is not the case: we don’t want the truth; we want a convenient simplification that’s well suited to the user’s purposes. To provide this is precisely to modify the user in some desired-by-them direction. Education of children isn’t a special case: it’s a clear example of a pretty general divergence between [accuracy of statement] and [change in accuracy of beliefs]. (again, any update is based on [[statement] was observed in context…], not on [statement]) A statement helpful in some contexts will be negligent in others. Select a statement to prioritise avoiding A-risks over avoiding B-risks, and B-riskers may judge you negligent. Prioritise B-risk avoidance and the A-riskers may judge you negligent. We might hope to provide The Truth in systems that only answer closed questions whose answers have a prescribed format (e.g. "What is 2 + 2?", where the system must output an integer). This is clearly highly limiting. For systems operating without constrained output, even closed questions aren’t so simple: all real-world problems are embedded. The appropriate answer to "What is 2 + 2?" can be "Duck!!", given the implicit priority of [I want not to be hit in the head by bricks]. A common type of ‘bricks’ for linguistic AI systems will be [predictable user inferences that are false]. Often enough such ‘inferences’ are implicit—e.g. "...and those are the only important risks for us to consider.", "...and those are all the important components of X.", or indeed "...and a brick isn’t about to hit me in the head.". If we ignore these, we cannot hope to provide a demonstrably robust solution to the problem. If we attempt to address them, we quickly run into problems: we can’t avoid all the bricks, and different people care more/​less about different bricks (one of which may be [excessive detail that distracts attention from key issues]). Travel a little farther down this road, and we meet our old friend intent alignment (i.e. a standard that gives each user what they want). Truthfulness is no longer doing useful work.

3.4 Section 3 summary

My overall point is that:

4: Incentives

*Moloch the incomprehensible prison! Moloch the crossbone soulless jailhouse and Congress of sorrows!*Allen Ginsberg

4.1 NF probability vs impact

Ideally, we want the incentives of AI creator organisations to be aligned with those of users. The natural way to do this is to consider the cost and benefit of a particular course of action to the organisation and to the user. This is difficult, since it involves assessing the downstream impact of AI statements. TruAI understandably wishes to avoid this, suggesting instead penalising AI producers according to the severity of falsehoods, regardless of their impact—i.e. the higher the certainty that a particular claim is an NF, the greater the penalty. However, it’s hard to see how this can work in practice: a trivial mistake that gains the organisation nothing would cost x, as would a strategic falsehood that gains the organisation millions of dollars. Make x millions of dollars and we might ensure that it never pays to mislead users—but we’ll make it uneconomic to produce most kinds of AI. Make x small enough to encourage the creation of AIs, and it’ll make sense for an AI to lie when the potential gains are high. There’d be some benefit in having different NSF penalties for different industries, but that’s a blunt tool. Without some measure of impact, this is not a solvable problem. Here it’s worth noting that [degree of certainty of falsehood] will not robustly correlate with [degree to which user was misled]. In many cases, more certain falsehoods will be more obviously false to users, and so likely to mislead less. For example: "The population of the USA is 370 million" vs "The population of the USA is three billion" In principle we’d like to rule out both. However, things get difficult whenever an AI must trade off [probability of being judged to have made a large error] with [probability of being judged to have made a small error]. Suppose that:The small error and large error would result in the same expected harm. (the large one being more obvious)The initial odds of making either error are small (<1 in 1000).The penalty for making the large error is four times higher than that for the small error.Halving the odds of making one error means doubling the odds of making the other. To optimise this for harm-reduction, we should make the odds of the two errors equal. If optimising for minimum penalty we’d instead halve the odds of the large error and double the odds of the small one (approximately). This would result in about 25% more expected harm than necessary. This particular situation isn’t likely, but in general you’d expect optimising for minimisation of penalties not to result in minimisation of user harm.
4.1.2 Opting out Penalising organisations according to probability-of-falsehood rather than based on harm has an additional disadvantage: it gives organisations a good argument not to use the standard. Benevolent and malign organisations alike can say: This standard incentivizes minimising degree of falsehood, which is a poor proxy for minimising harm. We’re committed to minimising harm directly, so we can’t in good conscience support a standard that impedes our ability to achieve that goal. To get a standard with teeth that organisations wish to adopt, it seems necessary to have a fairly good measure of expected harm. I don’t think probability-of-falsehood is good enough. (unfortunately, I don’t think a simple, good enough alternative exists)

4.2 Popular falsehoods and self-consistency

4.2.1 Cherished illusions Controversial questions may create difficulties for standards. However, a clearer danger is posed by questions where almost everyone agrees on something false, which they strongly want to believe. I’ll call these "cherished illusions" (CIs). Suppose almost all AIs state that [x], but O’s AI correctly states that [not x]. Now suppose that 95% of people believe [x], and find the possibility of [not x] horrible even to contemplate. Do we expect O to stand up for the truth in the face of a public outcry? I do not. How long before all such companies, standards committees etc optimise in part for [don’t claim anything wildly unpopular is true]? This isn’t a trade-off *against *Official_Truth, since they’ll be defining what’s ‘true’: it’s only the actual truth that gets lost. This doesn’t necessarily require anyone to optimise for what they believe to be false—only to selectively accept what an AI claims. I don’t think distributed standard-defining systems are likely to do much better, since they’re ultimately subject to the same underlying forces: pursuing the truth wherever it leads isn’t the priority of humans. CIs aren’t simply an external problem that acts through public pressure—this is just the most obviously unavoidable path of influence. AI researchers, programmers, board members… will tend to have CIs of their own ("I have no CIs" being a particularly promising CI candidate). How do/​did we get past CIs in society in the absence of advanced AIs with standards? We allow people/​organisations to be wrong, and don’t attempt to enforce centralised versions of accepted truth. The widespread in-group-conformity incentivized by social media already makes things worse. When aiming to think clearly, it’s often best avoided. Avoiding AI-enhanced thinking/​writing/​decision-making isn’t likely to be a practical option, so CI-supporting AI is likely to be a problem. 4.2.2 Self-consistency So far this may not seem too bad: we end up with standards and AIs that rule out a few truths that hardly anyone (in some group) believes and most people (in that group) want not to believe. However, in most other circumstances it’ll be expected and important for an AI to be self-consistent. For CIs, this leaves two choices (and a continuum between them): strictly enforce self-consistency, or abandon it for CIs. To abandon self-consistency entirely for CIs is to tacitly admit their falsehood—this is unlikely to be acceptable to people. On the other hand, the more we enforce self-consistency around CIs, the wider the web of false beliefs necessary to support them. In general, we won’t know the extent to which supporting CIs will warp credibility standards, or the expected impact of such warping. Clearly it’s epistemically preferable if we abandon CIs as soon as there’s good evidence, but that’s not an approach we can unilaterally apply in this world, humans being humans. 4.2.3 Trapped Priors The concept of trapped priors seems relevant here. To the extent that a truthfulness standard tends to impose some particular interpretation on reports of new evidence, it might not be possible to break out of an incorrect frame. My guess is that this should only be an issue in a small minority of cases. I haven’t thought about this in any depth. (e.g. can a sound epistemic process fail to limit to the truth due to TPs? It seems unlikely)

5: Harmful Standards

*Cherish those who seek the truth, but beware of those who find it.*Voltaire. Taking as an implicit default that standards will be aimed at truth seems optimistic. Here I refer to e.g. TruAI page 9: A worrying possibility is that enshrining some particular mechanism as an arbiter of truth would forestall our ability to have open-minded, varied, self-correcting approaches to discovering what’s true. This might happen as a result of political capture of the arbitration mechanisms — for propaganda or censorship — or as an accidental ossification of the notion of truth. We think this threat is worth considering seriously. Page 55: *...Mechanism could be abused to require "brainwashed" systems.**...Mechanism could be captured to enforce censorship… *[emphasis mine] The implicit suggestion here is that in the *absence *of capture, abuse or accident, we’d expect things to work out essentially as we intend. I don’t think this is a helpful or realistic framing. Rather I’d see getting what we intend as highly unlikely a priori: there’s little reason to suppose the outcome we want happens to be an attractor of the system considered broadly. Even if it were an attractor, getting to it may require the solution of a difficult coordination problem. Compare our desired result to a failure due to capture. Desired outcome: Statements must be sufficiently truthful [according to a process we approve of], unless [some process we approve of] determines there should be an exception. Capture outcome:Statements must be sufficiently truthful [according to a process we don’t approve of], unless [some process we don’t approve of] determines there should be an exception. Success is capture by a process we like. This isn’t a relativistic claim: there may be principled reasons to prefer the processes we like. Nonetheless, in game-theoretic terms the situation is essentially symmetric—and the other players need not care about our principles. Control over permitted AI speech is of huge significance (economically, politically, militarily…). By default, control goes to the powerful, not to the epistemically virtuous. We could hope to get ahead of the problem, by constructing a trusted mechanism that could not be corrupted, controlled or marginalised—but it’s hard to see how. Distributed approaches spring to mind, but I don’t know of any robustly truth-seeking setup. To get this right is to construct a robust solution that does not yet exist. Seeing capture as a "threat" isn’t wrong, but it feels akin to saying "we mustn’t rest on our laurels" before we have any laurels.

5.1 My prediction

By default, I would expect the following:

6: Practical suggestions

*Don’t try to solve serious matters in the middle of the night.*Philip K. Dick This section will be very sketchy—I don’t claim to have ideas that I consider adequate to the task. I’ll outline my current thoughts, some of which I expect to be misguided in one sense or another. However, it does seem important to proffer some ideas here since:

6.1 Limitations Evangelism

If user awareness of a standard’s limitations tends to reduce harm, then it’s important to be proactive in spreading such awareness. Clear documentation and transparent metrics are likely a good idea, but nowhere near sufficient. Ideally, we’d want every user to have direct experience of a standard’s failings: not simply an abstract description or benchmark score, but personal experience of having been misled in various ways and subsequently realising this. Clearly it’s preferable if this happens in a context where no great harm is inflicted. This won’t always be possible, but it’s the kind of thing I’d want to aim for. In general, I’d want to move such communication from the top of the following list to the bottom:

For any user group likely to have their decisions influenced by their trust levels in our standard, we’d want to show a range of the worst possible manipulations that can get past a particular version of the standard. 6.1.1 Future manipulation Here we might want to demonstrate both the current possibilities of AI manipulation /​ deceit /​ outside-the-spirit-of-truthfulness antics…, but also future possibilities depending on [currently unachievable capability]. For instance, we might set up a framework wherein we can ‘cheat’ and give an AI some not-yet-achievable capability by allowing it to see hidden state. We could then try to show worst-case manipulation possibilities given this capability. In an ideal world, we’d predict all near-term capability increases—but hopefully this wouldn’t be necessary: so long as users got a feel for the kinds of manipulation that tended to be possible with expanded capabilities, that might be sufficient.

6.2 Selection Problems

As observed in section 5.1, it’s highly plausible that various different standards will be set up, and that selection will occur. By default, the incentives involved will only partially match up with users’ interests. In most cases the default incentive will be for a standard to *appear to have *[desirable property] rather than to have [desirable property]. This suggests that hoping for limitations evangelism may be unrealistic. We may hope that those in influential positions do the right thing in spite of less-than-perfect incentives, but this seems highly optimistic:

7: Final thoughts

I hope some of this has been useful, in spite of my generally negative take on the enterprise. My current conclusions on standards are:

Comment

https://www.lesswrong.com/posts/Brr84ZmvK3kwy2eGJ/truthfulness-standards-and-credibility?commentId=DdDdnkj8PjXR2QL7k

Just finished reading this post. On the surface it may look like an excessively long critique of Truthful AI—and it is both long and contains such a critique. But it goes way beyond critiquing and explores a lot of fascinating nuance and complexity involved in judging the truth-value of statements, game theoretics around statements where truth standards are enforced, what manipulation looks like where every statement is literally true, the challenge of dealing with people’s "cherished illusions", etc. (I’m not a truthfulness researcher so perhaps a lot of this wouldn’t be news or interesting to someone who is though.) I learned a lot and was surprised by all the ways an idea like "AIs should only make true statements" which honestly sounded pretty good to me before can actually turn out very badly. I think this post makes a pretty strong argument that enforcing narrow truthfulness on an advanced AI is *not *sufficient for moving the needle on outer alignment in a positive direction (and it could make things worse). One question I have is whether a standard like Truthful AI could still be net-positive for AIs below a certain capability threshold. Like it’s clear to me this kind of standard doesn’t scale well on its own to AGI, but could it still be useful for present-day and near-future narrow AIs if we can be sure they’re below a level of sophisticated manipulation? Or could it perhaps be useful in a world where CAIS prevails, or STEM AI? (I’m not saying we have a reliable way to determine such a threshold, but assuming we did.)

Comment

https://www.lesswrong.com/posts/Brr84ZmvK3kwy2eGJ/truthfulness-standards-and-credibility?commentId=WQc59qNEyjMhQoHfi

Thanks. A few thoughts:

  • It is almost certainly too long. Could use editing/​distillation/​executive-summary. I erred on the side of leaving more in, since the audience I’m most concerned with are those who’re actively working in this area (though for them there’s a bit much statement-of-the-obvious, I imagine).

  • I don’t think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow.

  • As it relates to Truthful AI, much of the rest can be seen in terms of "Truthfulness amplification doesn’t bridge the gap". Here again, I doubt the authors would disagree. They never claim that it would, just that it expands the scope—that’s undeniably true.

  • On being net-positive below a certain threshold, I’d make a few observations:

  • For the near-term, this post only really argues that the Truthful AI case for positive impact is insufficient (not broad enough). I don’t think I’ve made a strong case the the output would be net negative, just that it’s a plausible outcome (it’d be my bet for most standards in most contexts).

  • I do think such standards would be useful in some sense for very near future AIs—those that are not capable of hard-to-detect manipulation. However, I’m not sure eliminating falsehoods there is helpful overall: it likely reduces immediate harm a little, but risks giving users the false impression that AIs won’t try to mislead them. If the first misleading AIs are undetectably misleading, that’s not good.

  • Some of the issues are less clearly applicable in a CAIS-like setup, but others seem pretty fundamental: e.g. that what we care about is something like [change in accuracy of beliefs] not [accuracy of statement]. The "all models are wrong" issue doesn’t go away. If you’re making determinations in the wrong language game, you’re going to make errors.

  • Worth emphasizing that "...and this path requires something like intent alignment" isn’t really a critique. That’s the element of Truthfulness research I think could be promising—looking at concepts in the vicinity of intent alignment from another angle. I just don’t expect standards that fall short of this to do much that’s useful, or to shed much light on the fundamentals.

  • ...but I may be wrong!