The widespread availability of corpora and corpus tools has led to considerable advances in our understanding of phraseology and how it can be investigated. An increasingly influential approach to investigating phraseology is the retrieval and analysis of phrase frames (p-frames), recurrent word sequences with a variable slot. Although the p-frame approach is quite well-established, pedagogically oriented research in this area is still quite new, in particular the application of phrase frames to generate useful lists of phrases. This paper aims to contribute to methodological efforts in this direction by presenting a case study to illustrate decision-making throughout the process from corpus compilation to the generation of a final pedagogical list of phrases. Using a corpus of research article (RA) introductions in Health Sciences, this study shows how key decisions including p-frame length, frequency and range thresholds were arrived at. It also discusses exclusion criteria focusing on variability and predictability of p-frame fillers, and the fine-tuning of the resulting list, focusing on semantic coherence and pedagogical usefulness. Finally, we present the results of an initial evaluation of the list by stakeholders. The main contributions of the study to p-frame research methodology lie in the clarification of threshold settings, the potential contribution of variability and predictability to decision-making and the proposal of more in-depth concordance analysis of co-text to aid in the identification of phrases.
The current tutorial paper describes a process of developing a custom natural language processing model with a particular focus on a discourse annotation task. After an overview of recent developments in natural language processing (NLP), the paper discusses the development of the Engagement Analyzer (Eguchi & Kyle, 2023), focusing on corpus annotation, the machine learning model, model training, evaluation, and dissemination. A step-by-step tutorial of this process via the spaCy Python package is provided. The paper highlights the feasibility of developing custom NLP tools to enhance the scalability and replicability of the annotation of context-sensitive linguistic features in L2 writing research.
Globalization and migration continue to shape our societies, including educational contexts such as school classrooms. In response to young learners’ linguistic needs, particularly in the context of foreign language learning in Germany, educational approaches need to be adapted to meet the needs of multilingual students. Current, binary approaches accounting for diverse linguistic backgrounds of students in research assume a high degree of homogeneity among multilingual students. Linguistic distance measures may provide alternative, more fine-grained, continuous tools to account for linguistic diversity. This study employs lexical linguistic distance to account for young language learners’ linguistic diversity in a reanalysis of Jaekel et al. (2017). Additionally, mixed-effects modeling was employed to factor in within-class effects for within-class factors versus structural equation modeling, which was previously used. The results outline that linguistic distance provides additional information beyond binary language status. Mixed effects modeling renders comparable results with the same tendencies, but yields more nuanced perspectives on the data.