{"title":"Estimating lexical diversity using the moving average type-token ratio (MATTR): Pros and cons","authors":"Yves Bestgen","doi":"10.1016/j.rmal.2024.100168","DOIUrl":null,"url":null,"abstract":"<div><div>Several recent studies have strongly recommended the use of the moving average type-token ratio (MATTR) to estimate the lexical diversity (LD) of a text because it is the only length-insensitive index that can compare texts of different sizes. After pointing out that a length-insensitive index was proposed in the 1960s and is still being used, I analyse the properties of the MATTR computational procedure that enable it to control for the effects of length. This index is an excellent choice for evaluating the fluctuation of the LD throughout a relatively long text. However, its use for evaluating the overall LD of a text is questionable because the impact of tokens on the score varies according to their position in the text. I illustrate this problem using pseudo-texts and show that this impact is likely to affect a significant proportion of texts by analysing the distribution of hapaxes in texts by learners of Italian, Czech, German and English as a second language (L2).</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 1","pages":"Article 100168"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766124000740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Several recent studies have strongly recommended the use of the moving average type-token ratio (MATTR) to estimate the lexical diversity (LD) of a text because it is the only length-insensitive index that can compare texts of different sizes. After pointing out that a length-insensitive index was proposed in the 1960s and is still being used, I analyse the properties of the MATTR computational procedure that enable it to control for the effects of length. This index is an excellent choice for evaluating the fluctuation of the LD throughout a relatively long text. However, its use for evaluating the overall LD of a text is questionable because the impact of tokens on the score varies according to their position in the text. I illustrate this problem using pseudo-texts and show that this impact is likely to affect a significant proportion of texts by analysing the distribution of hapaxes in texts by learners of Italian, Czech, German and English as a second language (L2).