Journal of Quantitative Linguistics最新文献

英文中文

Effects of Word Limit on Sentence Length and Clause Length in Academic Journal Article Abstracts: A Synergetic Linguistic Perspective 学术期刊论文摘要中字数限制对句子长度和子句长度的影响:协同语言学的视角

2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-10-02 DOI: 10.1080/09296174.2023.2263249

Yue Li, Yuan Gao, Xiaofei Lu

ABSTRACTSeveral studies have sought to characterize the syntactic features of research articles (RAs) and their part-genres. However, no study has examined the interrelation between different syntactic components (e.g. sentences and clauses) in the RA genre as a function of interacting internal and external factors (e.g. word limit) from a synergetic linguistic perspective. This study contributes to this line of research by investigating the effects of word limit (i.e. the restriction on the number of words used) on the length of sentences and clauses in RA abstracts. Our results show that RA abstracts contain significantly more longer sentences and clauses than the main body of RAs, but longer sentences in RA abstracts tend to have shorter constituting clauses, indicating that the Menzerath-Altmann Law is at play. Such an interrelation between sentence and clause length helps ensure a cognitively balanced system. Our findings have implications for the need to explore the interrelation between syntactic components emergent from the synergetic interactions of internal and external factors.KEYWORDS: Academic journal article abstractMenzerath-Altmann Lawsentence-clause interrelationsynergetic linguisticsword limit AcknowledgmentsWe appreciate the editors and anonymous reviewers for their constructive comments and suggestions.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended calculating the ratio of mean sentence (and clause) length for each abstract-body pair for the 26 RAs represented in the AJAB corpus and subsequently computing a mean ratio along with its 95% confidence interval. The results of this analysis are summarized in Appendix C. These results reveal similar patterns of differences as those reported in Table 2, with RA abstracts containing slightly longer sentences and slightly shorter clauses than RA bodies along with less variation, although the results appear inconclusive, possibly partially due to the relatively small number of pairs analysed and the smaller number of sentences in each abstract than in each body.2. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended running the MAL fitting analysis on the 26 abstracts and bodies of the RAs represented in AJAB for comparison purposes. Appendix D presents the mean clause length (measured in words) for sentences with different lengths in the 26 abstracts and bodies of the RAs represented in AJAB, and Appendix E presents the MAL fitting results on these abstracts and bodies. Similar to the results presented in Table 5, the coefficients of determination were larger than 0.9 for both corpora, with the RA abstracts showing a larger coefficient (0.9637 vs. 0.9380). Different from the results in Table 5, the F value for the RA abstracts did not reach statistical significance, and the b value for the RA abstracts was larger tha

摘要一些研究试图对科研论文的句法特征及其部分体裁进行表征。然而，目前还没有研究从协同语言学的角度考察RA体裁中不同句法成分(如句子和分句)之间的相互关系是内外因素(如字数限制)相互作用的结果。本研究通过调查字数限制(即对使用字数的限制)对RA摘要中句子和分句长度的影响，为这一研究方向做出了贡献。我们的研究结果表明，RA摘要中包含的长句子和从句明显多于RA的主体，但RA摘要中的长句子往往有较短的构成从句，这表明Menzerath-Altmann定律在起作用。句子和子句长度之间的这种相互关系有助于确保认知系统的平衡。我们的研究结果表明，需要探索由内部和外部因素协同作用产生的句法成分之间的相互关系。关键词:学术期刊文章摘要menzerath - altmann Lawsentence-clause相互关系协同语言学剑限感谢编辑和匿名审稿人提出的建设性意见和建议。披露声明作者未报告潜在的利益冲突。在本研究中，我们在单词代币方面平衡了AJAA和AJAB。一位审稿人建议计算AJAB语料库中26个RAs的每个摘要-正文对的平均句子(和子句)长度的比率，然后计算平均比率及其95%置信区间。该分析的结果总结在附录c中。这些结果显示了与表2中报告的相似的差异模式，RA摘要比RA正文包含稍长的句子和稍短的分句，并且变化较小，尽管结果似乎不确定，部分原因可能是分析的对相对较少，并且每个摘要的句子数量少于每个正文2。在本研究中，我们在单词代币方面平衡了AJAA和AJAB。一位审稿人建议对AJAB中代表的26个RAs摘要和主体进行MAL拟合分析，以进行比较。附录D给出了AJAB中所代表的26个RAs摘要和主体中不同长度句子的平均子句长度(以单词为单位)，附录E给出了这些摘要和主体的MAL拟合结果。与表5的结果相似，两种语料库的决定系数都大于0.9,RA摘要的决定系数更大(0.9637比0.9380)。与表5的结果不同，RA摘要的F值没有达到统计学意义，RA摘要的b值大于RA机构，这可能是由于RA摘要的数据点数量较少(即3个)(见附录D)。附加信息本研究由北京市社会科学基金(No. 18YYB002)和中央高校基本科研业务费(No. 18YYB002)两项资助资助。E1E41701)，联系通讯作者。

{"title":"Effects of Word Limit on Sentence Length and Clause Length in Academic Journal Article Abstracts: A Synergetic Linguistic Perspective","authors":"Yue Li, Yuan Gao, Xiaofei Lu","doi":"10.1080/09296174.2023.2263249","DOIUrl":"https://doi.org/10.1080/09296174.2023.2263249","url":null,"abstract":"ABSTRACTSeveral studies have sought to characterize the syntactic features of research articles (RAs) and their part-genres. However, no study has examined the interrelation between different syntactic components (e.g. sentences and clauses) in the RA genre as a function of interacting internal and external factors (e.g. word limit) from a synergetic linguistic perspective. This study contributes to this line of research by investigating the effects of word limit (i.e. the restriction on the number of words used) on the length of sentences and clauses in RA abstracts. Our results show that RA abstracts contain significantly more longer sentences and clauses than the main body of RAs, but longer sentences in RA abstracts tend to have shorter constituting clauses, indicating that the Menzerath-Altmann Law is at play. Such an interrelation between sentence and clause length helps ensure a cognitively balanced system. Our findings have implications for the need to explore the interrelation between syntactic components emergent from the synergetic interactions of internal and external factors.KEYWORDS: Academic journal article abstractMenzerath-Altmann Lawsentence-clause interrelationsynergetic linguisticsword limit AcknowledgmentsWe appreciate the editors and anonymous reviewers for their constructive comments and suggestions.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended calculating the ratio of mean sentence (and clause) length for each abstract-body pair for the 26 RAs represented in the AJAB corpus and subsequently computing a mean ratio along with its 95% confidence interval. The results of this analysis are summarized in Appendix C. These results reveal similar patterns of differences as those reported in Table 2, with RA abstracts containing slightly longer sentences and slightly shorter clauses than RA bodies along with less variation, although the results appear inconclusive, possibly partially due to the relatively small number of pairs analysed and the smaller number of sentences in each abstract than in each body.2. We balanced AJAA and AJAB in terms of word tokens in this study. One reviewer recommended running the MAL fitting analysis on the 26 abstracts and bodies of the RAs represented in AJAB for comparison purposes. Appendix D presents the mean clause length (measured in words) for sentences with different lengths in the 26 abstracts and bodies of the RAs represented in AJAB, and Appendix E presents the MAL fitting results on these abstracts and bodies. Similar to the results presented in Table 5, the coefficients of determination were larger than 0.9 for both corpora, with the RA abstracts showing a larger coefficient (0.9637 vs. 0.9380). Different from the results in Table 5, the F value for the RA abstracts did not reach statistical significance, and the b value for the RA abstracts was larger tha","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135829946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Structural Complexity of Chinese Words and Its Relationship with Word Frequency 汉语词汇的结构复杂性及其与词频的关系

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-07-06 DOI: 10.1080/09296174.2023.2231743

Xinpei Hong, Wei Huang, Haitao Liu

引用次数: 0

Zipf’s Law for Speech Acts in Spoken English 齐夫关于英语口语言语行为的定律

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-04-18 DOI: 10.1080/09296174.2023.2202470

Dang Qi, Hua Wang

引用次数: 3

Unifying Models for Word Length Distributions Based on Types and Tokens 基于类型和标记的字长分布统一模型

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-04-03 DOI: 10.1080/09296174.2023.2202061

Peter Zörnig, T. Berg

ABSTRACT Word length studies have been one of the central issues in Quantitative Linguistics for a long time. Most models were constructed for very specific purposes, i.e. the individual models apply only to a specific language, only to token counts or only to type counts. The present paper takes up the challenge of developing unifying models which account for both type and token frequencies of a moderately large sample of languages (eight Indo-European and two non-Indo-European languages). We introduce three models which can be well fitted to all our data: the exponentiated Hyper-Poisson distribution, the generalized gamma and the Sichel distribution. We also discuss the possibility of interpreting the model parameters linguistically.

长期以来，字长研究一直是数量语言学的核心问题之一。大多数模型都是为非常特定的目的构建的，即单个模型仅适用于特定的语言，仅适用于令牌计数或仅适用于类型计数。本文提出了开发统一模型的挑战，该模型考虑了中等规模的语言样本（八种印欧语言和两种非印欧语言）的类型和表征频率。我们介绍了三个可以很好地拟合我们所有数据的模型：指数超泊松分布、广义伽玛和Sichel分布。我们还讨论了用语言解释模型参数的可能性。

引用次数: 1

Synergetic Properties of Lexical Structures in Chinese and English 英汉词汇结构的协同特性

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-04-03 DOI: 10.1080/09296174.2023.2213107

Jieqiang Zhu, Jingyang Jiang

ABSTRACT The synergetic lexical model provides a unique framework for exploration of the interrelationships between the lexical properties of languages. Previous studies concerning several properties of this lexical model have yielded many successful fittings results, but very few studies have investigated synonymy, a major property of this model. The present study uses 825 Chinese and 848 English tokens retrieved from Chinese and English corpora, dictionaries, and thesaurus to conduct a contrastive study on the interrelations between four major properties of this lexical model: word length, word frequency, polysemy, and synonymy. The successful fittings of both languages demonstrate the cross-linguistic validity of the synergetic lexical model, though English belongs to the Germanic language family, while Chinese, a highly analytical language, is of the Sino-Tibetan language family. Moreover, our analysis of the parameters of the fitting results shows that, compared to English, Chinese possesses a greater resistance to shortening word length and a quicker response to semantic change.

协同词汇模型为探索语言词汇属性之间的相互关系提供了一个独特的框架。以往对该词汇模型的几个特性的研究已经获得了许多成功的拟合结果，但对该模型的一个主要特性同义性的研究却很少。本研究使用从汉英语料库、词典和同义词典中检索到的825个汉语和848个英语标记，对比研究了该词汇模型的四个主要属性:词长、词频、多义和同义词之间的相互关系。尽管英语属于日耳曼语系，而汉语属于汉藏语系，但两种语言的成功匹配表明了协同词汇模式的跨语言有效性。此外，我们对拟合结果的参数分析表明，与英语相比，汉语对词长缩短的抵抗更强，对语义变化的响应更快。

引用次数: 0

A Corpus-Based Study of the Distributions of Adnominals Across Registers and Disciplines 基于语料库的名词在语域和学科中的分布研究

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-04-03 DOI: 10.1080/09296174.2023.2209487

Yiyang Hu, Qingshun He

ABSTRACT Adnominals are an important resource of noun modification in written registers, especially in academic writing. This study compares the frequencies of adjectival adnominals and nominal adnominals across two registers (Fiction and Academic writing) by calculating T-values and conducting Welch’s t-tests on the adnominal subtypes. It is found that the preference for nominal adnominals exists in both the two registers and the mean frequencies of adjectival adnominals, premodifying nouns and postmodifying nouns increase as the register moves from Fiction to Academic writing. We further investigate the frequencies of adnominals in the research article abstracts across three disciplinary groups by conducting Welch’s ANOVA test. No significant difference is revealed in T-values in the research article abstracts across disciplines. The difference of adjectival adnominals, nouns as postmodifiers and appositive nouns lacks practical applications, while the effects of disciplines on the frequency of premodifying nouns cannot be rejected. It is the mean frequencies of premodifying nouns that show the significant difference in the research article abstracts across disciplines. Premodifying nouns are more prevalent in hard science texts than in soft science texts.

名词修饰是书面语域中名词修饰的重要来源，尤其是在学术写作中。本研究通过计算T值和对形容词亚型进行Welch T检验，比较了两个语域（小说和学术写作）中形容词和名词形容词的频率。研究发现，在这两个语域中都存在对名词性附加名词的偏好，而且随着语域从小说转向学术写作，形容词附加名词、修饰前名词和修饰后名词的平均频率都在增加。我们通过进行Welch方差分析测试，进一步调查了三个学科组的研究文章摘要中附加名词的频率。不同学科的研究文章摘要中的T值没有显著差异。形容词-附加名词、作为后修饰语的名词和同位名词的差异缺乏实际应用，而学科对名词前修饰频率的影响是不容忽视的。正是预修饰名词的平均频率显示了跨学科研究文章摘要的显著差异。修饰前名词在硬科学文本中比在软科学文本中更普遍。

{"title":"A Corpus-Based Study of the Distributions of Adnominals Across Registers and Disciplines","authors":"Yiyang Hu, Qingshun He","doi":"10.1080/09296174.2023.2209487","DOIUrl":"https://doi.org/10.1080/09296174.2023.2209487","url":null,"abstract":"ABSTRACT Adnominals are an important resource of noun modification in written registers, especially in academic writing. This study compares the frequencies of adjectival adnominals and nominal adnominals across two registers (Fiction and Academic writing) by calculating T-values and conducting Welch’s t-tests on the adnominal subtypes. It is found that the preference for nominal adnominals exists in both the two registers and the mean frequencies of adjectival adnominals, premodifying nouns and postmodifying nouns increase as the register moves from Fiction to Academic writing. We further investigate the frequencies of adnominals in the research article abstracts across three disciplinary groups by conducting Welch’s ANOVA test. No significant difference is revealed in T-values in the research article abstracts across disciplines. The difference of adjectival adnominals, nouns as postmodifiers and appositive nouns lacks practical applications, while the effects of disciplines on the frequency of premodifying nouns cannot be rejected. It is the mean frequencies of premodifying nouns that show the significant difference in the research article abstracts across disciplines. Premodifying nouns are more prevalent in hard science texts than in soft science texts.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"183 - 203"},"PeriodicalIF":1.4,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45089987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Too Noisy at the Bottom: Why Gries’ (2008, 2020) Dispersion Measures Cannot Identify Unbiased Distributions of Words 底部噪音太大：为什么Gries（20082020）的分散度量不能识别单词的无偏分布

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2023-02-02 DOI: 10.1080/09296174.2023.2172711

Robert N. Nelson

ABSTRACT Gries (2008, 2021) defined two dispersion measures able to alert corpus analysts to words that have a problematically limited distribution. Gries (2010, 2022) posited that these measures may additionally be relevant to language development research, as the learnability of a pattern may be predicted by the evenness of its distribution in corpora. However, both measures work by comparing vectors of observed and expected frequencies in partitioned corpora and this method cannot determine that a word is evenly distributed because it cannot distinguish the random noise inherent to an unbiased process from substantial non-random bias. An additional concern with the 2008 measure is raised: the 2008 measure is Manhattan distance scaled to the unit interval and, as such, it is extremely sensitive to the number of corpus parts because this choice sets the dimensionality of the measure space. In sum, this short analysis presents evidence that these measures should not be used to declare a pattern evenly distributed as neither can tell the difference between statistical noise and systematic bias.

摘要Gries（20082021）定义了两种分散度量，能够提醒语料库分析师注意分布有限的单词。Gries（20102022）认为，这些措施可能与语言发展研究有关，因为模式的可学习性可以通过其在语料库中的分布均匀性来预测。然而，这两种测量方法都是通过比较分割语料库中观察到的频率和预期频率的向量来工作的，并且这种方法不能确定一个词是均匀分布的，因为它不能区分无偏过程固有的随机噪声和实质上的非随机偏误。对2008度量提出了另一个担忧：2008度量是缩放到单位区间的曼哈顿距离，因此，它对语料库部分的数量非常敏感，因为这种选择设置了度量空间的维度。总之，这篇简短的分析提供了证据，证明这些措施不应用于宣布均匀分布的模式，因为它们都不能区分统计噪声和系统偏差。

引用次数: 0

Word Use Equivalence and Hierarchical Word Tiers 词汇使用等价和分层词层

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2022-10-13 DOI: 10.1080/09296174.2022.2129377

Brent D. Burch, Jesse Egbert

ABSTRACT A ranked word list provides information about the position of each word in the list. However, retaining and employing the measure used to generate the ranked list can yield additional information about the words. If denotes the prevalence of a word in a corpus, then not only can the values of be ordered, their values can be compared to one another, and words having similar values can be grouped together into equivalence classes. Measures of word prevalence include mean text frequency, the dispersion of words across texts in a corpus, or a measure that combines frequency and dispersion. In this paper, we examine the concepts of word equivalence classes and hierarchical word tiers and apply these concepts to the words in the British National Corpus (BNC). Hierarchical word tiers can be constructed without the knowledge of all pairwise comparisons of the words under study. By grouping words that have similar values of prevalence, the ranked ordered list reduces to an informative set of hierarchical word tiers where each tier contains words that are similar to one another in terms of their use in the corpus.

摘要排名单词列表提供了每个单词在列表中的位置信息。然而，保留和使用用于生成排序列表的度量可以产生关于单词的附加信息。如果表示一个单词在语料库中的普遍性，那么不仅可以对的值进行排序，还可以将它们的值相互比较，并且可以将具有相似值的单词分组到等价类中。单词流行率的衡量标准包括平均文本频率、语料库中单词在文本中的分散度，或结合频率和分散度的衡量标准。在本文中，我们考察了单词等价类和分层单词层的概念，并将这些概念应用于英国国家语料库中的单词。在不了解所研究单词的所有成对比较的情况下，可以构建分层的单词层次。通过对具有相似流行率值的单词进行分组，经排序的列表简化为一组信息丰富的分层单词层，其中每一层包含在语料库中使用方面彼此相似的单词。

引用次数: 0

Unified Parametrization of Phonetic Features and Numerical Calculation of Phonetic Distances between Speech Sounds 语音特征的统一参数化及语音距离的数值计算

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2022-07-25 DOI: 10.1080/09296174.2022.2095751

M. Vakulenko

ABSTRACT A metric method to numerically measure phonetic and phonemic distances or contrasts, between speech sounds, is put forward. The feature values of the compared phones taken from the standard IPA charts are treated as independent parameters that give rise to corresponding Euclidean distances. As an illustration, the general phone set is mapped to Ukrainian phonemes. The proposed model agrees well with the historical linguistic facts and experimental phonetic data. The described approach may find its due applications in various fields of linguistics and speech technologies, including historical and typological linguistics, language acquisition, phonetic studies, computational phonology, machine translation, information retrieval, and text-to-speech conversion.

摘要:提出了一种测量语音之间音位距离或对比的度量方法。比较电话的特征值从标准国际音标图被视为独立的参数，产生相应的欧几里得距离。作为一个例子，一般的电话机集被映射到乌克兰语的音素。该模型与历史语言学事实和实验语音数据吻合良好。所描述的方法可以在语言学和语音技术的各个领域找到应有的应用，包括历史和类型语言学，语言习得，语音研究，计算音韵学，机器翻译，信息检索和文本到语音的转换。

引用次数: 1

Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish 文体指纹、pos标签和屈折语言:波兰语的个案研究

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2022-06-05 DOI: 10.1080/09296174.2022.2122751

Maciej Eder, Rafal L. Górski

ABSTRACT In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.

摘要在风格计量研究中，最频繁单词（MFW）和字符n-gram的频率优于其他风格标记，即使它们的性能在不同语言之间存在显著差异。在屈折语言中，词尾起着重要作用，因此使用通用文本标记无法识别不同的单词形式。无数的屈折词形使频率变得稀疏，使大多数统计过程变得复杂。据推测，应用一种NLP技术，如引理化和/或解析，可能会提高分类的性能。本文的目的是检验语法特征（通过POS标记n-gram评估）和旅名化形式在识别作者简介方面的有用性，以解决词汇和语法中选择自由度的根本问题。使用波兰小说语料库，我们进行了一系列有监督的作者归因基准测试，以比较不同类型的词汇和句法风格标记的分类准确性。即使POS标记和旅鼠化形式的表现比词汇标记差得臭名昭著，但差异并不大，从未超过ca.15%。

{"title":"Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish","authors":"Maciej Eder, Rafal L. Górski","doi":"10.1080/09296174.2022.2122751","DOIUrl":"https://doi.org/10.1080/09296174.2022.2122751","url":null,"abstract":"ABSTRACT In stylometric investigations, frequencies of the most frequent words (MFWs) and character n-grams outperform other style-markers, even if their performance varies significantly across languages. In inflected languages, word endings play a prominent role, and hence different word forms cannot be recognized using generic text tokenization. Countless inflected word forms make frequencies sparse, making most statistical procedures complicated. Presumably, applying one of the NLP techniques, such as lemmatization and/or parsing, might increase the performance of classification. The aim of this paper is to examine the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing authorial profiles, in order to address the underlying issue of the degree of freedom of choice within lexis and grammar. Using a corpus of Polish novels, we performed a series of supervised authorship attribution benchmarks, in order to compare the classification accuracy for different types of lexical and syntactic style-markers. Even if the performance of POS-tags as well as lemmatized forms was notoriously worse than that of lexical markers, the difference was not substantial and never exceeded ca. 15%.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"30 1","pages":"86 - 103"},"PeriodicalIF":1.4,"publicationDate":"2022-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47075290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of Quantitative Linguistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀