This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.
扫码关注我们
求助内容:
应助结果提醒方式:
