ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam.
Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu, Canan Uluoğlu
{"title":"ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam.","authors":"Yavuz Selim Kıyak, Özlem Coşkun, Işıl İrem Budakoğlu, Canan Uluoğlu","doi":"10.1007/s00228-024-03649-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Artificial intelligence, specifically large language models such as ChatGPT, offers valuable potential benefits in question (item) writing. This study aimed to determine the feasibility of generating case-based multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels.</p><p><strong>Methods: </strong>This study involved 99 fourth-year medical students who participated in a rational pharmacotherapy clerkship carried out based-on the WHO 6-Step Model. In response to a prompt that we provided, ChatGPT generated ten case-based multiple-choice questions on hypertension. Following an expert panel, two of these multiple-choice questions were incorporated into a medical school exam without making any changes in the questions. Based on the administration of the test, we evaluated their psychometric properties, including item difficulty, item discrimination (point-biserial correlation), and functionality of the options.</p><p><strong>Results: </strong>Both questions exhibited acceptable levels of point-biserial correlation, which is higher than the threshold of 0.30 (0.41 and 0.39). However, one question had three non-functional options (options chosen by fewer than 5% of the exam participants) while the other question had none.</p><p><strong>Conclusions: </strong>The findings showed that the questions can effectively differentiate between students who perform at high and low levels, which also point out the potential of ChatGPT as an artificial intelligence tool in test development. Future studies may use the prompt to generate items in order for enhancing the external validity of the results by gathering data from diverse institutions and settings.</p>","PeriodicalId":11857,"journal":{"name":"European Journal of Clinical Pharmacology","volume":" ","pages":"729-735"},"PeriodicalIF":2.4000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Clinical Pharmacology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00228-024-03649-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/2/14 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Artificial intelligence, specifically large language models such as ChatGPT, offers valuable potential benefits in question (item) writing. This study aimed to determine the feasibility of generating case-based multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels.
Methods: This study involved 99 fourth-year medical students who participated in a rational pharmacotherapy clerkship carried out based-on the WHO 6-Step Model. In response to a prompt that we provided, ChatGPT generated ten case-based multiple-choice questions on hypertension. Following an expert panel, two of these multiple-choice questions were incorporated into a medical school exam without making any changes in the questions. Based on the administration of the test, we evaluated their psychometric properties, including item difficulty, item discrimination (point-biserial correlation), and functionality of the options.
Results: Both questions exhibited acceptable levels of point-biserial correlation, which is higher than the threshold of 0.30 (0.41 and 0.39). However, one question had three non-functional options (options chosen by fewer than 5% of the exam participants) while the other question had none.
Conclusions: The findings showed that the questions can effectively differentiate between students who perform at high and low levels, which also point out the potential of ChatGPT as an artificial intelligence tool in test development. Future studies may use the prompt to generate items in order for enhancing the external validity of the results by gathering data from diverse institutions and settings.
期刊介绍:
The European Journal of Clinical Pharmacology publishes original papers on all aspects of clinical pharmacology and drug therapy in humans. Manuscripts are welcomed on the following topics: therapeutic trials, pharmacokinetics/pharmacodynamics, pharmacogenetics, drug metabolism, adverse drug reactions, drug interactions, all aspects of drug development, development relating to teaching in clinical pharmacology, pharmacoepidemiology, and matters relating to the rational prescribing and safe use of drugs. Methodological contributions relevant to these topics are also welcomed.
Data from animal experiments are accepted only in the context of original data in man reported in the same paper. EJCP will only consider manuscripts describing the frequency of allelic variants in different populations if this information is linked to functional data or new interesting variants. Highly relevant differences in frequency with a major impact in drug therapy for the respective population may be submitted as a letter to the editor.
Straightforward phase I pharmacokinetic or pharmacodynamic studies as parts of new drug development will only be considered for publication if the paper involves
-a compound that is interesting and new in some basic or fundamental way, or
-methods that are original in some basic sense, or
-a highly unexpected outcome, or
-conclusions that are scientifically novel in some basic or fundamental sense.