{"title":"全科医学在职考试中基于计算机的自由文本响应形态分析方法的评价:算法验证研究。","authors":"Daiki Yokokawa, Kiyoshi Shikino, Yuji Nishizaki, Sho Fukui, Yasuharu Tokuda","doi":"10.2196/52068","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The General Medicine In-Training Examination (GM-ITE) tests clinical knowledge in a 2-year postgraduate residency program in Japan. In the academic year 2021, as a domain of medical safety, the GM-ITE included questions regarding the diagnosis from medical history and physical findings through video viewing and the skills in presenting a case. Examinees watched a video or audio recording of a patient examination and provided free-text responses. However, the human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model, thus, can be used to score free-text responses.</p><p><strong>Objective: </strong>This study aimed to compare human versus computer scoring of free-text responses and qualitatively evaluate the discrepancies between human- and machine-generated scores to assess the efficacy of machine scoring.</p><p><strong>Methods: </strong>After obtaining consent for participation in the study, the authors used text data from residents who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient's consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided statements for the case presentation. We obtained human-generated scores by collating the results of 2 independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022.</p><p><strong>Results: </strong>Of the 104 responses collected-63 for postgraduate year 1 and 41 for postgraduate year 2-39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 14 questions (7.2%); some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, whereas others were due to human error.</p><p><strong>Conclusions: </strong>Machine scoring is comparable to human scoring. It requires a simple program and calibration but can potentially reduce the cost of scoring free-text responses.</p>","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e52068"},"PeriodicalIF":3.2000,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11637224/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of a Computer-Based Morphological Analysis Method for Free-Text Responses in the General Medicine In-Training Examination: Algorithm Validation Study.\",\"authors\":\"Daiki Yokokawa, Kiyoshi Shikino, Yuji Nishizaki, Sho Fukui, Yasuharu Tokuda\",\"doi\":\"10.2196/52068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The General Medicine In-Training Examination (GM-ITE) tests clinical knowledge in a 2-year postgraduate residency program in Japan. In the academic year 2021, as a domain of medical safety, the GM-ITE included questions regarding the diagnosis from medical history and physical findings through video viewing and the skills in presenting a case. Examinees watched a video or audio recording of a patient examination and provided free-text responses. However, the human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model, thus, can be used to score free-text responses.</p><p><strong>Objective: </strong>This study aimed to compare human versus computer scoring of free-text responses and qualitatively evaluate the discrepancies between human- and machine-generated scores to assess the efficacy of machine scoring.</p><p><strong>Methods: </strong>After obtaining consent for participation in the study, the authors used text data from residents who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient's consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided statements for the case presentation. We obtained human-generated scores by collating the results of 2 independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022.</p><p><strong>Results: </strong>Of the 104 responses collected-63 for postgraduate year 1 and 41 for postgraduate year 2-39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 14 questions (7.2%); some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, whereas others were due to human error.</p><p><strong>Conclusions: </strong>Machine scoring is comparable to human scoring. It requires a simple program and calibration but can potentially reduce the cost of scoring free-text responses.</p>\",\"PeriodicalId\":36236,\"journal\":{\"name\":\"JMIR Medical Education\",\"volume\":\"10 \",\"pages\":\"e52068\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11637224/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/52068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/52068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
Evaluation of a Computer-Based Morphological Analysis Method for Free-Text Responses in the General Medicine In-Training Examination: Algorithm Validation Study.
Background: The General Medicine In-Training Examination (GM-ITE) tests clinical knowledge in a 2-year postgraduate residency program in Japan. In the academic year 2021, as a domain of medical safety, the GM-ITE included questions regarding the diagnosis from medical history and physical findings through video viewing and the skills in presenting a case. Examinees watched a video or audio recording of a patient examination and provided free-text responses. However, the human cost of scoring free-text answers may limit the implementation of GM-ITE. A simple morphological analysis and word-matching model, thus, can be used to score free-text responses.
Objective: This study aimed to compare human versus computer scoring of free-text responses and qualitatively evaluate the discrepancies between human- and machine-generated scores to assess the efficacy of machine scoring.
Methods: After obtaining consent for participation in the study, the authors used text data from residents who voluntarily answered the GM-ITE patient reproduction video-based questions involving simulated patients. The GM-ITE used video-based questions to simulate a patient's consultation in the emergency room with a diagnosis of pulmonary embolism following a fracture. Residents provided statements for the case presentation. We obtained human-generated scores by collating the results of 2 independent scorers and machine-generated scores by converting the free-text responses into a word sequence through segmentation and morphological analysis and matching them with a prepared list of correct answers in 2022.
Results: Of the 104 responses collected-63 for postgraduate year 1 and 41 for postgraduate year 2-39 cases remained for final analysis after excluding invalid responses. The authors found discrepancies between human and machine scoring in 14 questions (7.2%); some were due to shortcomings in machine scoring that could be resolved by maintaining a list of correct words and dictionaries, whereas others were due to human error.
Conclusions: Machine scoring is comparable to human scoring. It requires a simple program and calibration but can potentially reduce the cost of scoring free-text responses.