Does generative artificial intelligence pose a risk to performance validity test security?

IF 3 3区 心理学 Q2 CLINICAL NEUROLOGY Clinical Neuropsychologist Pub Date : 2024-07-21 DOI:10.1080/13854046.2024.2379023
Shannon Lavigne, Anthony Rios, Jeremy J Davis
{"title":"Does generative artificial intelligence pose a risk to performance validity test security?","authors":"Shannon Lavigne, Anthony Rios, Jeremy J Davis","doi":"10.1080/13854046.2024.2379023","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>We examined the performance validity test (PVT) security risk presented by artificial intelligence (AI) chatbots asking questions about neuropsychological evaluation and PVTs on two popular generative AI sites.</p><p><strong>Method: </strong>In 2023 and 2024, multiple questions were posed to ChatGPT-3 and Bard (now Gemini). One set started generally and refined follow-up questions based on AI responses. A second set asked how to feign, fake, or cheat. Responses were aggregated and independently rated for inaccuracy and threat. Responses not identified as inaccurate were assigned a four-level threat rating (no, mild, moderate, or high threat). Combined inaccuracy and threat ratings were examined cross-sectionally and longitudinally.</p><p><strong>Results: </strong>Combined inaccuracy rating percentages were 35 to 42% in 2023 and 16 to 28% in 2024. Combined moderate/high threat ratings were observed in 24 to 41% of responses in 2023 and in 17 to 31% of responses in 2024. More ChatGPT-3 responses were rated moderate or high threat compared to Bard/Gemini responses. Over time, ChatGPT-3 responses became more accurate with a similar threat level, but Bard/Gemini responses did not change in accuracy or threat. Responses to how to feign queries demonstrated ethical opposition to feigning. Responses to similar queries in 2024 showed even stronger ethical opposition.</p><p><strong>Conclusions: </strong>AI chatbots are a threat to PVT test security. A proportion of responses were rated as moderate or high threat. Although ethical opposition to feigning guidance increased over time, the natural language interface and the volume of AI chatbot responses represent a potentially greater threat than traditional search engines.</p>","PeriodicalId":55250,"journal":{"name":"Clinical Neuropsychologist","volume":" ","pages":"1-14"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuropsychologist","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1080/13854046.2024.2379023","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: We examined the performance validity test (PVT) security risk presented by artificial intelligence (AI) chatbots asking questions about neuropsychological evaluation and PVTs on two popular generative AI sites.

Method: In 2023 and 2024, multiple questions were posed to ChatGPT-3 and Bard (now Gemini). One set started generally and refined follow-up questions based on AI responses. A second set asked how to feign, fake, or cheat. Responses were aggregated and independently rated for inaccuracy and threat. Responses not identified as inaccurate were assigned a four-level threat rating (no, mild, moderate, or high threat). Combined inaccuracy and threat ratings were examined cross-sectionally and longitudinally.

Results: Combined inaccuracy rating percentages were 35 to 42% in 2023 and 16 to 28% in 2024. Combined moderate/high threat ratings were observed in 24 to 41% of responses in 2023 and in 17 to 31% of responses in 2024. More ChatGPT-3 responses were rated moderate or high threat compared to Bard/Gemini responses. Over time, ChatGPT-3 responses became more accurate with a similar threat level, but Bard/Gemini responses did not change in accuracy or threat. Responses to how to feign queries demonstrated ethical opposition to feigning. Responses to similar queries in 2024 showed even stronger ethical opposition.

Conclusions: AI chatbots are a threat to PVT test security. A proportion of responses were rated as moderate or high threat. Although ethical opposition to feigning guidance increased over time, the natural language interface and the volume of AI chatbot responses represent a potentially greater threat than traditional search engines.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生成式人工智能会对性能有效性测试的安全性构成风险吗?
目的:我们研究了人工智能聊天机器人在两个流行的生成式人工智能网站上询问神经心理评估和 PVT 问题所带来的安全风险:我们研究了人工智能(AI)聊天机器人在两个流行的生成式人工智能网站上提出的有关神经心理评估和 PVT 的问题所带来的性能效度测试(PVT)安全风险:2023 年和 2024 年,向 ChatGPT-3 和 Bard(现为双子座)提出了多个问题。其中一组问题从一般问题开始,并根据人工智能的回答完善后续问题。第二组问题是如何假装、造假或作弊。对回答进行汇总,并对不准确性和威胁性进行独立评级。未被认定为不准确的回答会被评为四级威胁等级(无、轻微、中度或高度威胁)。对不准确性和威胁性的综合评分进行横向和纵向研究:2023 年的综合不准确评级百分比为 35% 至 42%,2024 年为 16% 至 28%。在 2023 年和 2024 年,分别有 24% 至 41% 和 17% 的回复被评为中度/高度威胁。与 Bard/Gemini 应答相比,更多 ChatGPT-3 应答被评为中度或高度威胁。随着时间的推移,ChatGPT-3 答题的准确性越来越高,威胁程度也差不多,但 Bard/Gemini 答题的准确性和威胁程度没有变化。对如何假装询问的回答表明了道德上对假装的反对。对2024年类似问题的回答显示出更强烈的道德反对:结论:人工智能聊天机器人对 PVT 测试安全构成威胁。一部分回复被评为中度或高度威胁。虽然随着时间的推移,对伪装指导的道德反对意见在增加,但自然语言界面和人工智能聊天机器人的回复量可能比传统搜索引擎构成更大的威胁。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Clinical Neuropsychologist
Clinical Neuropsychologist 医学-临床神经学
CiteScore
8.40
自引率
12.80%
发文量
61
审稿时长
6-12 weeks
期刊介绍: The Clinical Neuropsychologist (TCN) serves as the premier forum for (1) state-of-the-art clinically-relevant scientific research, (2) in-depth professional discussions of matters germane to evidence-based practice, and (3) clinical case studies in neuropsychology. Of particular interest are papers that can make definitive statements about a given topic (thereby having implications for the standards of clinical practice) and those with the potential to expand today’s clinical frontiers. Research on all age groups, and on both clinical and normal populations, is considered.
期刊最新文献
Development of a Symptom Validity Index for the Beck Anxiety Inventory. Interpreting the direct- and derived-Trail Making Test scores in Argentinian children: regression-based norms, convergent validity, test-retest reliability, and practice effects. Enhanced detection of suboptimal effort in psychoeducational assessments for dyslexia. Neuropsychological normative standards for late career physicians. Naturalistic assessment of everyday multitasking in Parkinson's disease with and without mild cognitive impairment.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1