{"title":"All Your Base Are Belong to Us: The Urgent Reality of Unproctored Testing in the Age of LLMs","authors":"Louis Hickman","doi":"10.1111/ijsa.70005","DOIUrl":null,"url":null,"abstract":"<p>The release of new generative artificial intelligence (AI) tools, including new large language models (LLMs), continues at a rapid pace. Upon the release of OpenAI's new o1 models, I reconducted Hickman et al.'s (2024) analyses examining how well LLMs perform on a quantitative ability (number series) test. GPT-4 scored below the 20th percentile (compared to thousands of human test takers), but o1 scored at the 95th percentile. In response to these updated findings and Lievens and Dunlop's (2025) article about the effects of LLMs on the validity of pre-employment assessments, I make an urgent call to action for selection and assessment researchers and practitioners. A recent survey suggests that a large proportion of applicants are already using generative AI tools to complete high-stakes assessments, and it seems that no current assessments will be safe for long. Thus, I offer possibilities for the future of testing, detail their benefits and drawbacks, and provide recommendations. These possibilities are: increased use of proctoring, adding strict time limits, using LLM detection software, using think-aloud (or similar) protocols, collecting and analyzing trace data, emphasizing samples over signs, and redesigning assessments to allow LLM use during completion. Several of these possibilities inspire future research to modernize assessment. Future research should seek to improve our understanding of how to design valid assessments that allow LLM use, how to effectively use trace test-taker data, and whether think-aloud protocols can help differentiate experts and novices.</p>","PeriodicalId":51465,"journal":{"name":"International Journal of Selection and Assessment","volume":"33 2","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijsa.70005","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Selection and Assessment","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijsa.70005","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0
Abstract
The release of new generative artificial intelligence (AI) tools, including new large language models (LLMs), continues at a rapid pace. Upon the release of OpenAI's new o1 models, I reconducted Hickman et al.'s (2024) analyses examining how well LLMs perform on a quantitative ability (number series) test. GPT-4 scored below the 20th percentile (compared to thousands of human test takers), but o1 scored at the 95th percentile. In response to these updated findings and Lievens and Dunlop's (2025) article about the effects of LLMs on the validity of pre-employment assessments, I make an urgent call to action for selection and assessment researchers and practitioners. A recent survey suggests that a large proportion of applicants are already using generative AI tools to complete high-stakes assessments, and it seems that no current assessments will be safe for long. Thus, I offer possibilities for the future of testing, detail their benefits and drawbacks, and provide recommendations. These possibilities are: increased use of proctoring, adding strict time limits, using LLM detection software, using think-aloud (or similar) protocols, collecting and analyzing trace data, emphasizing samples over signs, and redesigning assessments to allow LLM use during completion. Several of these possibilities inspire future research to modernize assessment. Future research should seek to improve our understanding of how to design valid assessments that allow LLM use, how to effectively use trace test-taker data, and whether think-aloud protocols can help differentiate experts and novices.
期刊介绍:
The International Journal of Selection and Assessment publishes original articles related to all aspects of personnel selection, staffing, and assessment in organizations. Using an effective combination of academic research with professional-led best practice, IJSA aims to develop new knowledge and understanding in these important areas of work psychology and contemporary workforce management.