This study investigates the reliability of Large Language Models (LLMs) in evaluating the quality of written lesson plans from pre-service teachers. A total of 32 lesson plans, each ranging from 60 to 100 pages, were collected during a teacher internship program for civic education pre-service teachers. Using the ChatGPT-o1 reasoning model, we compared a human expert standard with LLM coding outcomes in a two-phase explanatory sequential mixed-methods design that combined quantitative reliability testing with a qualitative follow-up analysis to interpret inter-dimensional patterns of agreement. Quantitatively, overall reliability across six qualitative components of written lesson plans (Content Transformation, Task Creation, Adaptation, Goal Clarification, Contextualization and Sequencing) reached a moderate alignment in identifying explicit instructional features (α = .689; 73.8 % exact agreement). Qualitative analyses further revealed that the LLM struggled with high-inferential criteria, such as the depth of pedagogical reasoning and the coherence of instructional decisions, as it often relied on surface-level textual cues rather than deeper contextual understanding. These findings indicate that LLMs can support teacher educators and educational researchers as a design-stage screening tool, but human judgment remains essential for interpreting complex pedagogical constructs in written lesson plans and for ensuring the ethical and pedagogical integrity of evaluation processes. We outline implications for integrating LLM-based analysis into teacher education and emphasize improved prompt design and systematic human oversight to ensure reliable qualitative use.
扫码关注我们
求助内容:
应助结果提醒方式:
