Pub Date : 2023-10-02DOI: 10.1080/09296174.2023.2263249
Yue Li, Yuan Gao, Xiaofei Lu
ABSTRACTSeveral studies have sought to characterize the syntactic features of research articles (RAs) and their part-genres. However, no study has examined the interrelation between different syntactic components (e.g. sentences and clauses) in the RA genre as a function of interacting internal and external factors (e.g. word limit) from a synergetic linguistic perspective. This study contributes to this line of research by investigating the effects of word limit (i.e. the restriction on the number of words used) on the length of sentences and clauses in RA abstracts. Our results show that RA abstracts contain significantly more longer sentences and clauses than the main body of RAs, but longer sentences in RA abstracts tend to have shorter constituting clauses, indicating that the Menzerath-Altmann Law is at play. Such an interrelation between sentence and clause length helps ensure a cognitively balanced system. Our findings have implications for the need to explore the interrelation between syntactic components emergent from the synergetic interactions of internal and external factors.KEYWORDS: Academic journal article abstractMenzerath-Altmann Lawsentence-clause interrelationsynergetic linguisticsword limit AcknowledgmentsWe appreciate the editors and anonymous reviewers for their constructive comments and suggestions.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended calculating the ratio of mean sentence (and clause) length for each abstract-body pair for the 26 RAs represented in the AJAB corpus and subsequently computing a mean ratio along with its 95% confidence interval. The results of this analysis are summarized in Appendix C. These results reveal similar patterns of differences as those reported in Table 2, with RA abstracts containing slightly longer sentences and slightly shorter clauses than RA bodies along with less variation, although the results appear inconclusive, possibly partially due to the relatively small number of pairs analysed and the smaller number of sentences in each abstract than in each body.2. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended running the MAL fitting analysis on the 26 abstracts and bodies of the RAs represented in AJAB for comparison purposes. Appendix D presents the mean clause length (measured in words) for sentences with different lengths in the 26 abstracts and bodies of the RAs represented in AJAB, and Appendix E presents the MAL fitting results on these abstracts and bodies. Similar to the results presented in Table 5, the coefficients of determination were larger than 0.9 for both corpora, with the RA abstracts showing a larger coefficient (0.9637 vs. 0.9380). Different from the results in Table 5, the F value for the RA abstracts did not reach statistical significance, and the b value for the RA abstracts was larger tha
{"title":"Effects of Word Limit on Sentence Length and Clause Length in Academic Journal Article Abstracts: A Synergetic Linguistic Perspective","authors":"Yue Li, Yuan Gao, Xiaofei Lu","doi":"10.1080/09296174.2023.2263249","DOIUrl":"https://doi.org/10.1080/09296174.2023.2263249","url":null,"abstract":"ABSTRACTSeveral studies have sought to characterize the syntactic features of research articles (RAs) and their part-genres. However, no study has examined the interrelation between different syntactic components (e.g. sentences and clauses) in the RA genre as a function of interacting internal and external factors (e.g. word limit) from a synergetic linguistic perspective. This study contributes to this line of research by investigating the effects of word limit (i.e. the restriction on the number of words used) on the length of sentences and clauses in RA abstracts. Our results show that RA abstracts contain significantly more longer sentences and clauses than the main body of RAs, but longer sentences in RA abstracts tend to have shorter constituting clauses, indicating that the Menzerath-Altmann Law is at play. Such an interrelation between sentence and clause length helps ensure a cognitively balanced system. Our findings have implications for the need to explore the interrelation between syntactic components emergent from the synergetic interactions of internal and external factors.KEYWORDS: Academic journal article abstractMenzerath-Altmann Lawsentence-clause interrelationsynergetic linguisticsword limit AcknowledgmentsWe appreciate the editors and anonymous reviewers for their constructive comments and suggestions.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended calculating the ratio of mean sentence (and clause) length for each abstract-body pair for the 26 RAs represented in the AJAB corpus and subsequently computing a mean ratio along with its 95% confidence interval. The results of this analysis are summarized in Appendix C. These results reveal similar patterns of differences as those reported in Table 2, with RA abstracts containing slightly longer sentences and slightly shorter clauses than RA bodies along with less variation, although the results appear inconclusive, possibly partially due to the relatively small number of pairs analysed and the smaller number of sentences in each abstract than in each body.2. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended running the MAL fitting analysis on the 26 abstracts and bodies of the RAs represented in AJAB for comparison purposes. Appendix D presents the mean clause length (measured in words) for sentences with different lengths in the 26 abstracts and bodies of the RAs represented in AJAB, and Appendix E presents the MAL fitting results on these abstracts and bodies. Similar to the results presented in Table 5, the coefficients of determination were larger than 0.9 for both corpora, with the RA abstracts showing a larger coefficient (0.9637 vs. 0.9380). Different from the results in Table 5, the F value for the RA abstracts did not reach statistical significance, and the b value for the RA abstracts was larger tha","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135829946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-06DOI: 10.1080/09296174.2023.2231743
Xinpei Hong, Wei Huang, Haitao Liu
{"title":"The Structural Complexity of Chinese Words and Its Relationship with Word Frequency","authors":"Xinpei Hong, Wei Huang, Haitao Liu","doi":"10.1080/09296174.2023.2231743","DOIUrl":"https://doi.org/10.1080/09296174.2023.2231743","url":null,"abstract":"","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47269921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-18DOI: 10.1080/09296174.2023.2202470
Dang Qi, Hua Wang
{"title":"Zipf’s Law for Speech Acts in Spoken English","authors":"Dang Qi, Hua Wang","doi":"10.1080/09296174.2023.2202470","DOIUrl":"https://doi.org/10.1080/09296174.2023.2202470","url":null,"abstract":"","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"1 1","pages":""},"PeriodicalIF":1.4,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45872910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/09296174.2023.2202061
Peter Zörnig, T. Berg
ABSTRACT Word length studies have been one of the central issues in Quantitative Linguistics for a long time. Most models were constructed for very specific purposes, i.e. the individual models apply only to a specific language, only to token counts or only to type counts. The present paper takes up the challenge of developing unifying models which account for both type and token frequencies of a moderately large sample of languages (eight Indo-European and two non-Indo-European languages). We introduce three models which can be well fitted to all our data: the exponentiated Hyper-Poisson distribution, the generalized gamma and the Sichel distribution. We also discuss the possibility of interpreting the model parameters linguistically.
{"title":"Unifying Models for Word Length Distributions Based on Types and Tokens","authors":"Peter Zörnig, T. Berg","doi":"10.1080/09296174.2023.2202061","DOIUrl":"https://doi.org/10.1080/09296174.2023.2202061","url":null,"abstract":"ABSTRACT Word length studies have been one of the central issues in Quantitative Linguistics for a long time. Most models were constructed for very specific purposes, i.e. the individual models apply only to a specific language, only to token counts or only to type counts. The present paper takes up the challenge of developing unifying models which account for both type and token frequencies of a moderately large sample of languages (eight Indo-European and two non-Indo-European languages). We introduce three models which can be well fitted to all our data: the exponentiated Hyper-Poisson distribution, the generalized gamma and the Sichel distribution. We also discuss the possibility of interpreting the model parameters linguistically.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"167 - 182"},"PeriodicalIF":1.4,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47161954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/09296174.2023.2213107
Jieqiang Zhu, Jingyang Jiang
ABSTRACT The synergetic lexical model provides a unique framework for exploration of the interrelationships between the lexical properties of languages. Previous studies concerning several properties of this lexical model have yielded many successful fittings results, but very few studies have investigated synonymy, a major property of this model. The present study uses 825 Chinese and 848 English tokens retrieved from Chinese and English corpora, dictionaries, and thesaurus to conduct a contrastive study on the interrelations between four major properties of this lexical model: word length, word frequency, polysemy, and synonymy. The successful fittings of both languages demonstrate the cross-linguistic validity of the synergetic lexical model, though English belongs to the Germanic language family, while Chinese, a highly analytical language, is of the Sino-Tibetan language family. Moreover, our analysis of the parameters of the fitting results shows that, compared to English, Chinese possesses a greater resistance to shortening word length and a quicker response to semantic change.
{"title":"Synergetic Properties of Lexical Structures in Chinese and English","authors":"Jieqiang Zhu, Jingyang Jiang","doi":"10.1080/09296174.2023.2213107","DOIUrl":"https://doi.org/10.1080/09296174.2023.2213107","url":null,"abstract":"ABSTRACT The synergetic lexical model provides a unique framework for exploration of the interrelationships between the lexical properties of languages. Previous studies concerning several properties of this lexical model have yielded many successful fittings results, but very few studies have investigated synonymy, a major property of this model. The present study uses 825 Chinese and 848 English tokens retrieved from Chinese and English corpora, dictionaries, and thesaurus to conduct a contrastive study on the interrelations between four major properties of this lexical model: word length, word frequency, polysemy, and synonymy. The successful fittings of both languages demonstrate the cross-linguistic validity of the synergetic lexical model, though English belongs to the Germanic language family, while Chinese, a highly analytical language, is of the Sino-Tibetan language family. Moreover, our analysis of the parameters of the fitting results shows that, compared to English, Chinese possesses a greater resistance to shortening word length and a quicker response to semantic change.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"204 - 230"},"PeriodicalIF":1.4,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48436195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1080/09296174.2023.2209487
Yiyang Hu, Qingshun He
ABSTRACT Adnominals are an important resource of noun modification in written registers, especially in academic writing. This study compares the frequencies of adjectival adnominals and nominal adnominals across two registers (Fiction and Academic writing) by calculating T-values and conducting Welch’s t-tests on the adnominal subtypes. It is found that the preference for nominal adnominals exists in both the two registers and the mean frequencies of adjectival adnominals, premodifying nouns and postmodifying nouns increase as the register moves from Fiction to Academic writing. We further investigate the frequencies of adnominals in the research article abstracts across three disciplinary groups by conducting Welch’s ANOVA test. No significant difference is revealed in T-values in the research article abstracts across disciplines. The difference of adjectival adnominals, nouns as postmodifiers and appositive nouns lacks practical applications, while the effects of disciplines on the frequency of premodifying nouns cannot be rejected. It is the mean frequencies of premodifying nouns that show the significant difference in the research article abstracts across disciplines. Premodifying nouns are more prevalent in hard science texts than in soft science texts.
{"title":"A Corpus-Based Study of the Distributions of Adnominals Across Registers and Disciplines","authors":"Yiyang Hu, Qingshun He","doi":"10.1080/09296174.2023.2209487","DOIUrl":"https://doi.org/10.1080/09296174.2023.2209487","url":null,"abstract":"ABSTRACT Adnominals are an important resource of noun modification in written registers, especially in academic writing. This study compares the frequencies of adjectival adnominals and nominal adnominals across two registers (Fiction and Academic writing) by calculating T-values and conducting Welch’s t-tests on the adnominal subtypes. It is found that the preference for nominal adnominals exists in both the two registers and the mean frequencies of adjectival adnominals, premodifying nouns and postmodifying nouns increase as the register moves from Fiction to Academic writing. We further investigate the frequencies of adnominals in the research article abstracts across three disciplinary groups by conducting Welch’s ANOVA test. No significant difference is revealed in T-values in the research article abstracts across disciplines. The difference of adjectival adnominals, nouns as postmodifiers and appositive nouns lacks practical applications, while the effects of disciplines on the frequency of premodifying nouns cannot be rejected. It is the mean frequencies of premodifying nouns that show the significant difference in the research article abstracts across disciplines. Premodifying nouns are more prevalent in hard science texts than in soft science texts.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"183 - 203"},"PeriodicalIF":1.4,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45089987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-02DOI: 10.1080/09296174.2023.2172711
Robert N. Nelson
ABSTRACT Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.
{"title":"Too Noisy at the Bottom: Why Gries’ (2008, 2020) Dispersion Measures Cannot Identify Unbiased Distributions of Words","authors":"Robert N. Nelson","doi":"10.1080/09296174.2023.2172711","DOIUrl":"https://doi.org/10.1080/09296174.2023.2172711","url":null,"abstract":"ABSTRACT Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"153 - 166"},"PeriodicalIF":1.4,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45407703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-13DOI: 10.1080/09296174.2022.2129377
Brent D. Burch, Jesse Egbert
ABSTRACT A ranked word list provides information about the position of each word in the list. However, retaining and employing the measure used to generate the ranked list can yield additional information about the words. If denotes the prevalence of a word in a corpus, then not only can the values of be ordered, their values can be compared to one another, and words having similar values can be grouped together into equivalence classes. Measures of word prevalence include mean text frequency, the dispersion of words across texts in a corpus, or a measure that combines frequency and dispersion. In this paper, we examine the concepts of word equivalence classes and hierarchical word tiers and apply these concepts to the words in the British National Corpus (BNC). Hierarchical word tiers can be constructed without the knowledge of all pairwise comparisons of the words under study. By grouping words that have similar values of prevalence, the ranked ordered list reduces to an informative set of hierarchical word tiers where each tier contains words that are similar to one another in terms of their use in the corpus.
{"title":"Word Use Equivalence and Hierarchical Word Tiers","authors":"Brent D. Burch, Jesse Egbert","doi":"10.1080/09296174.2022.2129377","DOIUrl":"https://doi.org/10.1080/09296174.2022.2129377","url":null,"abstract":"ABSTRACT A ranked word list provides information about the position of each word in the list. However, retaining and employing the measure used to generate the ranked list can yield additional information about the words. If denotes the prevalence of a word in a corpus, then not only can the values of be ordered, their values can be compared to one another, and words having similar values can be grouped together into equivalence classes. Measures of word prevalence include mean text frequency, the dispersion of words across texts in a corpus, or a measure that combines frequency and dispersion. In this paper, we examine the concepts of word equivalence classes and hierarchical word tiers and apply these concepts to the words in the British National Corpus (BNC). Hierarchical word tiers can be constructed without the knowledge of all pairwise comparisons of the words under study. By grouping words that have similar values of prevalence, the ranked ordered list reduces to an informative set of hierarchical word tiers where each tier contains words that are similar to one another in terms of their use in the corpus.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"104 - 124"},"PeriodicalIF":1.4,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47466683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-25DOI: 10.1080/09296174.2022.2095751
M. Vakulenko
ABSTRACT A metric method to numerically measure phonetic and phonemic distances or contrasts, between speech sounds, is put forward. The feature values of the compared phones taken from the standard IPA charts are treated as independent parameters that give rise to corresponding Euclidean distances. As an illustration, the general phone set is mapped to Ukrainian phonemes. The proposed model agrees well with the historical linguistic facts and experimental phonetic data. The described approach may find its due applications in various fields of linguistics and speech technologies, including historical and typological linguistics, language acquisition, phonetic studies, computational phonology, machine translation, information retrieval, and text-to-speech conversion.
{"title":"Unified Parametrization of Phonetic Features and Numerical Calculation of Phonetic Distances between Speech Sounds","authors":"M. Vakulenko","doi":"10.1080/09296174.2022.2095751","DOIUrl":"https://doi.org/10.1080/09296174.2022.2095751","url":null,"abstract":"ABSTRACT A metric method to numerically measure phonetic and phonemic distances or contrasts, between speech sounds, is put forward. The feature values of the compared phones taken from the standard IPA charts are treated as independent parameters that give rise to corresponding Euclidean distances. As an illustration, the general phone set is mapped to Ukrainian phonemes. The proposed model agrees well with the historical linguistic facts and experimental phonetic data. The described approach may find its due applications in various fields of linguistics and speech technologies, including historical and typological linguistics, language acquisition, phonetic studies, computational phonology, machine translation, information retrieval, and text-to-speech conversion.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"67 - 85"},"PeriodicalIF":1.4,"publicationDate":"2022-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43046455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-05DOI: 10.1080/09296174.2022.2122751
Maciej Eder, Rafal L. Górski
ABSTRACT In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.
{"title":"Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish","authors":"Maciej Eder, Rafal L. Górski","doi":"10.1080/09296174.2022.2122751","DOIUrl":"https://doi.org/10.1080/09296174.2022.2122751","url":null,"abstract":"ABSTRACT In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"86 - 103"},"PeriodicalIF":1.4,"publicationDate":"2022-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47075290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}