Does generative artificial intelligence pose a risk to performance validity test security?

IF 3 3区心理学 Q2 CLINICAL NEUROLOGY Clinical Neuropsychologist Pub Date : 2024-07-21 DOI:10.1080/13854046.2024.2379023

Shannon Lavigne, Anthony Rios, Jeremy J Davis

{"title":"Does generative artificial intelligence pose a risk to performance validity test security?","authors":"Shannon Lavigne, Anthony Rios, Jeremy J Davis","doi":"10.1080/13854046.2024.2379023","DOIUrl":null,"url":null,"abstract":"Objective: We examined the performance validity test (PVT) security risk presented by artificial intelligence (AI) chatbots asking questions about neuropsychological evaluation and PVTs on two popular generative AI sites.Method: In 2023 and 2024, multiple questions were posed to ChatGPT-3 and Bard (now Gemini). One set started generally and refined follow-up questions based on AI responses. A second set asked how to feign, fake, or cheat. Responses were aggregated and independently rated for inaccuracy and threat. Responses not identified as inaccurate were assigned a four-level threat rating (no, mild, moderate, or high threat). Combined inaccuracy and threat ratings were examined cross-sectionally and longitudinally.Results: Combined inaccuracy rating percentages were 35 to 42% in 2023 and 16 to 28% in 2024. Combined moderate/high threat ratings were observed in 24 to 41% of responses in 2023 and in 17 to 31% of responses in 2024. More ChatGPT-3 responses were rated moderate or high threat compared to Bard/Gemini responses. Over time, ChatGPT-3 responses became more accurate with a similar threat level, but Bard/Gemini responses did not change in accuracy or threat. Responses to how to feign queries demonstrated ethical opposition to feigning. Responses to similar queries in 2024 showed even stronger ethical opposition.Conclusions: AI chatbots are a threat to PVT test security. A proportion of responses were rated as moderate or high threat. Although ethical opposition to feigning guidance increased over time, the natural language interface and the volume of AI chatbot responses represent a potentially greater threat than traditional search engines.","PeriodicalId":55250,"journal":{"name":"Clinical Neuropsychologist","volume":" ","pages":"1-14"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuropsychologist","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1080/13854046.2024.2379023","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: We examined the performance validity test (PVT) security risk presented by artificial intelligence (AI) chatbots asking questions about neuropsychological evaluation and PVTs on two popular generative AI sites.

Method: In 2023 and 2024, multiple questions were posed to ChatGPT-3 and Bard (now Gemini). One set started generally and refined follow-up questions based on AI responses. A second set asked how to feign, fake, or cheat. Responses were aggregated and independently rated for inaccuracy and threat. Responses not identified as inaccurate were assigned a four-level threat rating (no, mild, moderate, or high threat). Combined inaccuracy and threat ratings were examined cross-sectionally and longitudinally.

Results: Combined inaccuracy rating percentages were 35 to 42% in 2023 and 16 to 28% in 2024. Combined moderate/high threat ratings were observed in 24 to 41% of responses in 2023 and in 17 to 31% of responses in 2024. More ChatGPT-3 responses were rated moderate or high threat compared to Bard/Gemini responses. Over time, ChatGPT-3 responses became more accurate with a similar threat level, but Bard/Gemini responses did not change in accuracy or threat. Responses to how to feign queries demonstrated ethical opposition to feigning. Responses to similar queries in 2024 showed even stronger ethical opposition.

Conclusions: AI chatbots are a threat to PVT test security. A proportion of responses were rated as moderate or high threat. Although ethical opposition to feigning guidance increased over time, the natural language interface and the volume of AI chatbot responses represent a potentially greater threat than traditional search engines.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生成式人工智能会对性能有效性测试的安全性构成风险吗？

目的：我们研究了人工智能聊天机器人在两个流行的生成式人工智能网站上询问神经心理评估和 PVT 问题所带来的安全风险：我们研究了人工智能（AI）聊天机器人在两个流行的生成式人工智能网站上提出的有关神经心理评估和 PVT 的问题所带来的性能效度测试（PVT）安全风险：2023 年和 2024 年，向 ChatGPT-3 和 Bard（现为双子座）提出了多个问题。其中一组问题从一般问题开始，并根据人工智能的回答完善后续问题。第二组问题是如何假装、造假或作弊。对回答进行汇总，并对不准确性和威胁性进行独立评级。未被认定为不准确的回答会被评为四级威胁等级（无、轻微、中度或高度威胁）。对不准确性和威胁性的综合评分进行横向和纵向研究：2023 年的综合不准确评级百分比为 35% 至 42%，2024 年为 16% 至 28%。在 2023 年和 2024 年，分别有 24% 至 41% 和 17% 的回复被评为中度/高度威胁。与 Bard/Gemini 应答相比，更多 ChatGPT-3 应答被评为中度或高度威胁。随着时间的推移，ChatGPT-3 答题的准确性越来越高，威胁程度也差不多，但 Bard/Gemini 答题的准确性和威胁程度没有变化。对如何假装询问的回答表明了道德上对假装的反对。对2024年类似问题的回答显示出更强烈的道德反对：结论：人工智能聊天机器人对 PVT 测试安全构成威胁。一部分回复被评为中度或高度威胁。虽然随着时间的推移，对伪装指导的道德反对意见在增加，但自然语言界面和人工智能聊天机器人的回复量可能比传统搜索引擎构成更大的威胁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Clinical Neuropsychologist 医学-临床神经学

CiteScore

8.40

自引率

12.80%

发文量

审稿时长

6-12 weeks

期刊介绍： The Clinical Neuropsychologist (TCN) serves as the premier forum for (1) state-of-the-art clinically-relevant scientific research, (2) in-depth professional discussions of matters germane to evidence-based practice, and (3) clinical case studies in neuropsychology. Of particular interest are papers that can make definitive statements about a given topic (thereby having implications for the standards of clinical practice) and those with the potential to expand today’s clinical frontiers. Research on all age groups, and on both clinical and normal populations, is considered.