Beyond Preferences in AI Alignment

IF 1.3 1区哲学 0 PHILOSOPHY PHILOSOPHICAL STUDIES Pub Date : 2024-11-09 DOI:10.1007/s11098-024-02249-w

Tan Zhi-Xuan, Micah Carroll, Matija Franklin, Hal Ashton

{"title":"Beyond Preferences in AI Alignment","authors":"Tan Zhi-Xuan, Micah Carroll, Matija Franklin, Hal Ashton","doi":"10.1007/s11098-024-02249-w","DOIUrl":null,"url":null,"abstract":"<p>The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a <i>preferentist</i> approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.</p>","PeriodicalId":48305,"journal":{"name":"PHILOSOPHICAL STUDIES","volume":"1 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PHILOSOPHICAL STUDIES","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11098-024-02249-w","RegionNum":1,"RegionCategory":"哲学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PHILOSOPHY","Score":null,"Total":0}

引用次数: 0

Abstract

The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. Whether implicitly followed or explicitly endorsed, these commitments constitute what we term a preferentist approach to AI alignment. In this paper, we characterize and challenge the preferentist approach, describing conceptual and technical alternatives that are ripe for further research. We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values. We then critique the normativity of expected utility theory (EUT) for humans and AI, drawing upon arguments showing how rational agents need not comply with EUT, while highlighting how EUT is silent on which preferences are normatively acceptable. Finally, we argue that these limitations motivate a reframing of the targets of AI alignment: Instead of alignment with the preferences of a human user, developer, or humanity-writ-large, AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant. Furthermore, these standards should be negotiated and agreed upon by all relevant stakeholders. On this alternative conception of alignment, a multiplicity of AI systems will be able to serve diverse ends, aligned with normative standards that promote mutual benefit and limit harm despite our plural and divergent values.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人工智能排列组合中的偏好之外

人工智能调整的主流做法假定：(1) 偏好是人类价值观的充分代表；(2) 人类的理性可以从最大化满足偏好的角度来理解；(3) 人工智能系统应与一个或多个人类的偏好保持一致，以确保其行为安全并符合我们的价值观。无论是默示遵循还是明确认可，这些承诺都构成了我们所说的人工智能偏好一致性方法。在本文中，我们描述了优先选择主义方法的特点并对其提出了质疑，同时描述了概念和技术上的替代方案，这些方案的进一步研究时机已经成熟。我们首先探讨了理性选择理论作为描述性模型的局限性，解释了偏好如何无法捕捉到人类价值观的丰富语义内容，以及效用表征如何忽视了这些价值观可能存在的不可通约性。然后，我们对人类和人工智能的预期效用理论（EUT）的规范性进行了批判，借鉴了表明理性代理人无需遵守 EUT 的论点，同时强调了 EUT 如何对哪些偏好在规范上是可接受的保持沉默。最后，我们认为，这些局限性促使我们重新构建人工智能对齐的目标：人工智能系统不应与人类用户、开发者或广大人类的偏好保持一致，而应与适合其社会角色的规范标准保持一致，例如通用助手的角色。此外，这些标准应由所有利益相关者协商并达成一致。根据这种替代性的协调概念，尽管我们的价值观多元且不尽相同，但多种多样的人工智能系统将能够服务于不同的目的，并与促进互利和限制伤害的规范性标准保持一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

PHILOSOPHICAL STUDIES PHILOSOPHY-

CiteScore

2.60

自引率

7.70%

发文量

127

期刊介绍： Philosophical Studies was founded in 1950 by Herbert Feigl and Wilfrid Sellars to provide a periodical dedicated to work in analytic philosophy. The journal remains devoted to the publication of papers in exclusively analytic philosophy. Papers applying formal techniques to philosophical problems are welcome. The principal aim is to publish articles that are models of clarity and precision in dealing with significant philosophical issues. It is intended that readers of the journal will be kept abreast of the central issues and problems of contemporary analytic philosophy. Double-blind review procedure The journal follows a double-blind reviewing procedure. Authors are therefore requested to place their name and affiliation on a separate page. Self-identifying citations and references in the article text should either be avoided or left blank when manuscripts are first submitted. Authors are responsible for reinserting self-identifying citations and references when manuscripts are prepared for final submission.

期刊最新文献

Fictional names, theoretical names, and indeterminate existence The preface paradox and fragmented justification Zetetic norms and the normative autonomy of logic How many species are there? A plea for multilateralism