Journal of Quantitative Linguistics最新文献_第7页

Lexical Richness and Text Length: An Entropy-based Perspective 词汇丰富度与语篇长度：基于熵的视角

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-06-10 DOI: 10.1080/09296174.2020.1766346

Yaqian Shi, L. Lei

ABSTRACT Text length is a major concern in the measurement of lexical richness, and how lexical richness is affected by text length still remains open. The present study aims to explore the relation between text length and lexical richness from an entropy-based perspective. Results show a non-linear growth pattern of lexical richness by increasing text length. To be specific, lexical richness increases rapidly with shorter texts. It soon reaches a boundary point from which it stabilizes despite the continuous expansion of text length. The boundary point of the lexical richness by the Shannon estimation is around 1000 tokens and that by the Zhang estimation is lower and more varied, including 500, 800, and 1000 tokens. Such stability may be explained by the stabilization of word probability in the text.

摘要语篇长度是衡量词汇丰富度的一个主要问题，语篇长度对词汇丰富度的影响仍然存在争议。本研究旨在从熵的角度探讨语篇长度与词汇丰富度之间的关系。结果表明，随着文本长度的增加，词汇丰富度呈非线性增长模式。具体地说，词汇的丰富性随着文本的缩短而迅速增加。尽管文本长度不断扩大，但它很快就到达了一个稳定的边界点。Shannon估计的词汇丰富度的边界点约为1000个标记，Zhang估计的边界点较低且变化较大，包括500、800和1000个标记。这种稳定性可以用文本中单词概率的稳定性来解释。

引用次数: 15

A Word Embedding Model for Analyzing Patterns and Their Distributional Semantics 用于分析模式及其分布语义的单词嵌入模型

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-06-07 DOI: 10.1080/09296174.2020.1767481

Rui Feng, Congcong Yang, Yunhua Qu

ABSTRACT Recent advances in natural language processing have catalysed active research in designing algorithms to generate contextual vector representations of words, or word embedding, in the machine learning and computational linguistics community. Existing works pay little attention to patterns of words, which encode rich semantic information and impose semantic constraints on a word’s context. This paper explores the feasibility of incorporating word embedding with pattern grammar, a grammar model to describe the syntactic environment of lexical items. Specifically, this research develops a method to extract patterns with semantic information of word embedding and investigates the statistical regularities and distributional semantics of the extracted patterns. The major results of this paper are as follows. Experiments on the LCMC Chinese corpus reveal that the frequency of patterns follows Zipf’s hypothesis, and the frequency and pattern length are inversely related. Therefore, the proposed method enables the study of distributional properties of patterns in large-scale corpora. Furthermore, experiments illustrate that our extracted patterns impose semantic constraints on context, proving that patterns encode rich semantic and contextual information. This sheds light on the potential applications of pattern-based word embedding in a wide range of natural language processing tasks.

自然语言处理的最新进展促进了机器学习和计算语言学社区在设计算法以生成词的上下文向量表示或词嵌入方面的积极研究。现有的研究很少关注词的模式，它编码了丰富的语义信息，并对词的上下文施加了语义约束。本文探讨了将词嵌入与模式语法(一种描述词汇项句法环境的语法模型)相结合的可行性。具体而言，本研究开发了一种基于词嵌入语义信息的模式提取方法，并研究了提取模式的统计规律和分布语义。本文的主要研究结果如下:在LCMC汉语语料库上的实验表明，模式出现的频率符合Zipf假设，且模式出现的频率与模式长度呈负相关。因此，该方法可以研究大规模语料库中模式的分布特性。此外，实验表明，我们提取的模式对上下文施加了语义约束，证明模式编码了丰富的语义和上下文信息。这揭示了基于模式的词嵌入在广泛的自然语言处理任务中的潜在应用。

{"title":"A Word Embedding Model for Analyzing Patterns and Their Distributional Semantics","authors":"Rui Feng, Congcong Yang, Yunhua Qu","doi":"10.1080/09296174.2020.1767481","DOIUrl":"https://doi.org/10.1080/09296174.2020.1767481","url":null,"abstract":"ABSTRACT Recent advances in natural language processing have catalysed active research in designing algorithms to generate contextual vector representations of words, or word embedding, in the machine learning and computational linguistics community. Existing works pay little attention to patterns of words, which encode rich semantic information and impose semantic constraints on a word’s context. This paper explores the feasibility of incorporating word embedding with pattern grammar, a grammar model to describe the syntactic environment of lexical items. Specifically, this research develops a method to extract patterns with semantic information of word embedding and investigates the statistical regularities and distributional semantics of the extracted patterns. The major results of this paper are as follows. Experiments on the LCMC Chinese corpus reveal that the frequency of patterns follows Zipf’s hypothesis, and the frequency and pattern length are inversely related. Therefore, the proposed method enables the study of distributional properties of patterns in large-scale corpora. Furthermore, experiments illustrate that our extracted patterns impose semantic constraints on context, proving that patterns encode rich semantic and contextual information. This sheds light on the potential applications of pattern-based word embedding in a wide range of natural language processing tasks.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"80 - 105"},"PeriodicalIF":1.4,"publicationDate":"2020-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1767481","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46869414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Does Menzerath–Altmann Law Hold True for Translational Language: Evidence from Translated English Literary Texts Menzerath–Altmann定律适用于翻译语言吗？——来自翻译英语文学文本的证据

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-05-24 DOI: 10.1080/09296174.2020.1766335

Yue Jiang, Ruimin Ma

ABSTRACT Menzerath–Altmann Law (MAL) is regarded as one of the fundamental laws of language due to its extensive validity for different languages at various linguistic levels and applicability for register differentiation. However, whether MAL holds true for translational language remains to be answered. Translational language, different from both the source language and target original (non-translated) language, is viewed as ‘the third code’. This study delves into the validity of MAL for translated English literary texts and its comparable original texts by exploring the relationship between the sentence length (in number of clauses) and the clause length (in number of words). Results of the study corroborate that MAL held true for both original and translated texts. In addition, both a and b, the fitting parameters of MAL formula, could differentiate the translational language from the original, thus justifying the uniqueness of translational language as ‘the third code’ in its own right. This finding suggests that the fitting parameters might be viable indicators for typological differentiation in translation studies. Further, exploring the dynamic relations between a language construct and its constituents may shed some light on the translating process.

Menzerath-Altmann定律(MAL)被认为是语言的基本定律之一，因为它在不同的语言层次上对不同的语言具有广泛的有效性，并且适用于语域的区分。然而，MAL是否适用于翻译语言还有待回答。翻译语言既不同于源语言，也不同于目标语言(非翻译)，被视为“第三码”。本研究通过考察句子长度(从句数)和句子长度(词数)之间的关系，探讨了英语文学译文及其同类原文的MAL有效性。研究结果证实了MAL对原文和译文都适用。此外，MAL公式的拟合参数a和b都可以将翻译语言与原文区分开来，从而证明了翻译语言本身作为“第三码”的独特性。这一发现表明，拟合参数可能是翻译研究中类型分化的可行指标。此外，探索语言结构及其组成部分之间的动态关系可能会对翻译过程有所启示。

{"title":"Does Menzerath–Altmann Law Hold True for Translational Language: Evidence from Translated English Literary Texts","authors":"Yue Jiang, Ruimin Ma","doi":"10.1080/09296174.2020.1766335","DOIUrl":"https://doi.org/10.1080/09296174.2020.1766335","url":null,"abstract":"ABSTRACT Menzerath–Altmann Law (MAL) is regarded as one of the fundamental laws of language due to its extensive validity for different languages at various linguistic levels and applicability for register differentiation. However, whether MAL holds true for translational language remains to be answered. Translational language, different from both the source language and target original (non-translated) language, is viewed as ‘the third code’. This study delves into the validity of MAL for translated English literary texts and its comparable original texts by exploring the relationship between the sentence length (in number of clauses) and the clause length (in number of words). Results of the study corroborate that MAL held true for both original and translated texts. In addition, both a and b, the fitting parameters of MAL formula, could differentiate the translational language from the original, thus justifying the uniqueness of translational language as ‘the third code’ in its own right. This finding suggests that the fitting parameters might be viable indicators for typological differentiation in translation studies. Further, exploring the dynamic relations between a language construct and its constituents may shed some light on the translating process.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"37 - 61"},"PeriodicalIF":1.4,"publicationDate":"2020-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1766335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47344679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Is Queen’s English Drifting Towards Common People’s English? —Quantifying Diachronic Changes of Queen’s Christmas Messages (1952–2018) with Reference to BNC 女王英语正在向普通英语靠拢吗?——参考BNC量化女王圣诞致辞的历时变化(1952-2018

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-05-18 DOI: 10.1080/09296174.2020.1737483

Xinlei Jiang, Yue Jiang, C. Hoi

ABSTRACT Queen's English (QE), a linguistic symbol of the royal or upper class, is a particular variety or an aristocratic form of English. However, QE has been dethroned by a surprising finding that it shifted phonologically towards common people's English (CE) between the 1950s-1980s, arousing a debate on its existence. Based upon Queen's Christmas Messages (1952-2018) and BNC, this study quantitatively investigated whether QE has experienced diachronic changes and drifted towards CE. Our PCA analysis shows QE's fluctuating lexical richness, increasing lexical complexity and synthetism, and steady syntactic features during the six decades. Piecewise regression and statistical results indicate 1) QE is drifting towards CE in lexical richness and complexity between the 1950s-1980s; 2) QE exhibits an interaction between a “drifting force” and a “deviating force” towards or from CE between the 1950s-1980s in syntactic features; 3) QE maintains a synthetic form distinct from the analytical one of CE over the 66 years. These phenomena are likely related to the collapsing social structure between the 1950s-1980s, identity building in Queen's early reign and age factor. This study firstly quantify the drift of QE towards CE lexically and syntactically, which may shed some light on quantitative investigation of diachronic language changes.

摘要女王英语（Queen’s English，QE）是王室或上层阶级的语言象征，是英语的一种特殊变体或贵族形式。然而，QE被一个令人惊讶的发现所取代，即它在20世纪50年代至80年代期间在语音上转向了普通人英语（CE），引发了关于其存在的争论。本研究基于女王的圣诞致辞（1952-2018）和英国国家银行，定量调查了QE是否经历了历时性变化并向CE漂移。我们的主成分分析显示，在过去的60年里，QE的词汇丰富性不断波动，词汇复杂性和综合性不断增加，句法特征稳定。分段回归和统计结果表明：（1）20世纪50年代至80年代，QE在词汇丰富度和复杂性上向CE漂移；2） QE在句法特征上表现出20世纪50年代至80年代之间向CE或偏离CE的“漂移力”和“偏离力”之间的相互作用；3）在66年的时间里，QE保持着一种不同于CE分析形式的合成形式。这些现象可能与20世纪50年代至80年代社会结构的崩溃、女王统治初期的身份认同建设以及年龄因素有关。本研究首先从词汇和句法两个方面量化了QE向CE的漂移，这可能为语言历时变化的定量研究提供一些启示。

{"title":"Is Queen’s English Drifting Towards Common People’s English? —Quantifying Diachronic Changes of Queen’s Christmas Messages (1952–2018) with Reference to BNC","authors":"Xinlei Jiang, Yue Jiang, C. Hoi","doi":"10.1080/09296174.2020.1737483","DOIUrl":"https://doi.org/10.1080/09296174.2020.1737483","url":null,"abstract":"ABSTRACT Queen's English (QE), a linguistic symbol of the royal or upper class, is a particular variety or an aristocratic form of English. However, QE has been dethroned by a surprising finding that it shifted phonologically towards common people's English (CE) between the 1950s-1980s, arousing a debate on its existence. Based upon Queen's Christmas Messages (1952-2018) and BNC, this study quantitatively investigated whether QE has experienced diachronic changes and drifted towards CE. Our PCA analysis shows QE's fluctuating lexical richness, increasing lexical complexity and synthetism, and steady syntactic features during the six decades. Piecewise regression and statistical results indicate 1) QE is drifting towards CE in lexical richness and complexity between the 1950s-1980s; 2) QE exhibits an interaction between a “drifting force” and a “deviating force” towards or from CE between the 1950s-1980s in syntactic features; 3) QE maintains a synthetic form distinct from the analytical one of CE over the 66 years. These phenomena are likely related to the collapsing social structure between the 1950s-1980s, identity building in Queen's early reign and age factor. This study firstly quantify the drift of QE towards CE lexically and syntactically, which may shed some light on quantitative investigation of diachronic language changes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"1 - 36"},"PeriodicalIF":1.4,"publicationDate":"2020-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1737483","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47278557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Probability Distribution of Dependency Distance Based on a Treebank of Japanese EFL Learners’ Interlanguage 基于日本外语学习者中介语树库的依赖距离概率分布

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-04-26 DOI: 10.1080/09296174.2020.1754611

Wenping Li, Jianwei Yan

ABSTRACT Ouyang and Jiang (2018) measured the second language proficiency of English as a foreign language (EFL) learners based on the probability distribution of dependency distance. However, the typological features of the native language (Chinese) and the target language (English) they adopted are generally considered similar in word order and dependency direction. In addition, their method of classifying the learners’ proficiency levels is based on the learners’ grades, which might weaken the validity of the results. These results are strengthened and verified further in the current research by analysing a treebank of Japanese EFL learners’ interlanguage since their native language and the target language are typologically distinctive. Moreover, the TOEIC score was used as a benchmark to classify the second language proficiency levels of the learners. We found that (1) the mean dependency distance can measure the syntactic complexity of Japanese EFL learners’ interlanguage; (2) constrained by human working memory, the probability distribution of dependency distance based on Japanese EFL learners’ interlanguage follows certain distribution patterns as unveiled in other natural human languages; (3) the parameters of the right truncated modified Zipf-Alekseev distribution can well reflect the changes of the Japanese EFL learners’ second language proficiency, indicating the development of interlanguage.

摘要欧阳和姜（2018）基于依赖距离的概率分布来衡量英语学习者的第二语言水平。然而，他们所采用的母语（汉语）和目标语言（英语）的类型学特征通常被认为在语序和依赖方向上相似。此外，他们根据学习者的成绩对学习者的熟练程度进行分类的方法可能会削弱结果的有效性。这些结果在当前的研究中得到了进一步的加强和验证，通过分析日本英语学习者的中介语树库，因为他们的母语和目的语在类型上是不同的。此外，TOEIC分数被用作对学习者的第二语言熟练程度进行分类的基准。研究发现：（1）平均依赖距离可以衡量日本英语学习者中介语的句法复杂性；（2）受人类工作记忆的约束，日本外语学习者中介语依赖距离的概率分布遵循其他自然人类语言所揭示的一定分布模式；（3）右截断修正的Zipf—Alekseev分布参数能很好地反映日本英语学习者第二语言水平的变化，反映了中介语的发展。

{"title":"Probability Distribution of Dependency Distance Based on a Treebank of Japanese EFL Learners’ Interlanguage","authors":"Wenping Li, Jianwei Yan","doi":"10.1080/09296174.2020.1754611","DOIUrl":"https://doi.org/10.1080/09296174.2020.1754611","url":null,"abstract":"ABSTRACT Ouyang and Jiang (2018) measured the second language proficiency of English as a foreign language (EFL) learners based on the probability distribution of dependency distance. However, the typological features of the native language (Chinese) and the target language (English) they adopted are generally considered similar in word order and dependency direction. In addition, their method of classifying the learners’ proficiency levels is based on the learners’ grades, which might weaken the validity of the results. These results are strengthened and verified further in the current research by analysing a treebank of Japanese EFL learners’ interlanguage since their native language and the target language are typologically distinctive. Moreover, the TOEIC score was used as a benchmark to classify the second language proficiency levels of the learners. We found that (1) the mean dependency distance can measure the syntactic complexity of Japanese EFL learners’ interlanguage; (2) constrained by human working memory, the probability distribution of dependency distance based on Japanese EFL learners’ interlanguage follows certain distribution patterns as unveiled in other natural human languages; (3) the parameters of the right truncated modified Zipf-Alekseev distribution can well reflect the changes of the Japanese EFL learners’ second language proficiency, indicating the development of interlanguage.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"172 - 186"},"PeriodicalIF":1.4,"publicationDate":"2020-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1754611","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49365214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Study of Optimal Text Size Phenomenon in Zipf–Mandelbrot’s Distribution on the Bases of Full and Distorted Texts. Author’s Frequency Characteristics and derivation of Hapax Legomena 基于完整和扭曲文本的Zipf–Mandelbrot分布中的最优文本大小现象研究。Hapax Legomena的作者频率特征及其推导

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-04-02 DOI: 10.1080/09296174.2018.1559460

O. Gorina, Natalya S. Tsarakova, Sergey K. Tsarakov

ABSTRACT This paper explores word-frequency patterns when considering text length, authorship, and random distortion of texts. Through a series of experiments, we determined an optimal text size, a phenomenon that was predicted by George Zipf, which sees a minimal discrepancy between calculated and observed frequencies. A graphic representation allowed a plausible explanation behind the existence of this phenomenon. Working on the assumption that distorted texts might disobey Zipf’s Law, we explored correlations among frequencies and text entirety compared with text distortions. Results reveal the crucial role of text length for maintaining Zipfian distribution: randomly chosen sets of words and fragmentary texts of optimal size still obey Zipf’s Law. Findings show that authorship manifests itself through the author constant, defined as the relative frequency of the most frequent words, which remains constant throughout the works of any given author, including randomly chosen text chunks and fragments of sentences of various sizes.

摘要本文探讨了在考虑文本长度、作者身份和文本随机失真时的词频模式。通过一系列实验，我们确定了最佳文本大小，这是乔治·齐普夫预测的一种现象，它可以看到计算频率和观测频率之间的最小差异。一个图形表示允许对这一现象的存在进行合理的解释。基于扭曲文本可能违反齐普夫定律的假设，我们探讨了与文本扭曲相比，频率和文本整体之间的相关性。结果揭示了文本长度对保持齐普夫分布的关键作用：随机选择的单词集和最佳大小的零碎文本仍然遵循齐普夫定律。研究结果表明，作者身份通过作者常数表现出来，作者常数被定义为最频繁单词的相对频率，在任何给定作者的作品中都保持不变，包括随机选择的不同大小的文本块和句子片段。

{"title":"Study of Optimal Text Size Phenomenon in Zipf–Mandelbrot’s Distribution on the Bases of Full and Distorted Texts. Author’s Frequency Characteristics and derivation of Hapax Legomena","authors":"O. Gorina, Natalya S. Tsarakova, Sergey K. Tsarakov","doi":"10.1080/09296174.2018.1559460","DOIUrl":"https://doi.org/10.1080/09296174.2018.1559460","url":null,"abstract":"ABSTRACT This paper explores word-frequency patterns when considering text length, authorship, and random distortion of texts. Through a series of experiments, we determined an optimal text size, a phenomenon that was predicted by George Zipf, which sees a minimal discrepancy between calculated and observed frequencies. A graphic representation allowed a plausible explanation behind the existence of this phenomenon. Working on the assumption that distorted texts might disobey Zipf’s Law, we explored correlations among frequencies and text entirety compared with text distortions. Results reveal the crucial role of text length for maintaining Zipfian distribution: randomly chosen sets of words and fragmentary texts of optimal size still obey Zipf’s Law. Findings show that authorship manifests itself through the author constant, defined as the relative frequency of the most frequent words, which remains constant throughout the works of any given author, including randomly chosen text chunks and fragments of sentences of various sizes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"134 - 158"},"PeriodicalIF":1.4,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1559460","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48207752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Rank-Frequency and Type-Token Statistics to Compare Morphological Typology in the Celtic Languages 使用秩频和类型标记统计比较凯尔特语的形态类型学

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-04-02 DOI: 10.1080/09296174.2018.1560122

Andrew Wilson, Rosie Harvey

ABSTRACT Previous work has used Greenberg’s synthetism index to compare three of the Celtic languages – Irish, Welsh, and Breton – but not the other three languages, namely Scottish Gaelic, Manx, and Cornish. This paper extends this earlier work by comparing all six Celtic languages, including two periods of Irish (Early Modern and Present Day). The analysis is based on a random sample of 210 parallel psalm texts (30 for each language). However, Greenberg’s synthetism index is problematic because there are no operational standards for counting morphemes within words. We therefore apply a newer typological indicator (B7), which is based solely on lexical rank-frequency statistics. We also explore whether type-token counts alone can provide similar information. The B7 indicator shows that both varieties of Irish, together with Welsh and Cornish, tend more towards synthetism, whereas Manx tends more towards analytism. Breton and Scottish Gaelic do not show a clear tendency in either direction. Rankings using type-token statistics vary considerably and do not tell the same story.

摘要先前的工作使用了Greenberg的综合指数来比较凯尔特人的三种语言——爱尔兰语、威尔士语和布列塔尼语——但没有比较其他三种语言，即苏格兰盖尔语、曼克斯语和康沃尔语。本文通过比较所有六种凯尔特语言，包括爱尔兰语的两个时期（现代早期和现代），扩展了这项早期工作。该分析基于210篇平行诗篇文本的随机样本（每种语言30篇）。然而，格林伯格的合成指数是有问题的，因为没有计算单词中语素的操作标准。因此，我们应用了一种新的类型学指标（B7），它仅基于词汇排名频率统计。我们还探讨了单独的类型令牌计数是否可以提供类似的信息。B7指标显示，爱尔兰语的两种变体，以及威尔士语和康沃尔语，都更倾向于合成主义，而马恩语则更倾向于分析主义。布列塔尼语和苏格兰盖尔语在这两个方向上都没有明显的趋势。使用类型标记统计数据的排名差异很大，并不能说明相同的情况。

{"title":"Using Rank-Frequency and Type-Token Statistics to Compare Morphological Typology in the Celtic Languages","authors":"Andrew Wilson, Rosie Harvey","doi":"10.1080/09296174.2018.1560122","DOIUrl":"https://doi.org/10.1080/09296174.2018.1560122","url":null,"abstract":"ABSTRACT Previous work has used Greenberg’s synthetism index to compare three of the Celtic languages – Irish, Welsh, and Breton – but not the other three languages, namely Scottish Gaelic, Manx, and Cornish. This paper extends this earlier work by comparing all six Celtic languages, including two periods of Irish (Early Modern and Present Day). The analysis is based on a random sample of 210 parallel psalm texts (30 for each language). However, Greenberg’s synthetism index is problematic because there are no operational standards for counting morphemes within words. We therefore apply a newer typological indicator (B7), which is based solely on lexical rank-frequency statistics. We also explore whether type-token counts alone can provide similar information. The B7 indicator shows that both varieties of Irish, together with Welsh and Cornish, tend more towards synthetism, whereas Manx tends more towards analytism. Breton and Scottish Gaelic do not show a clear tendency in either direction. Rankings using type-token statistics vary considerably and do not tell the same story.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"159 - 186"},"PeriodicalIF":1.4,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1560122","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41466832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Multifactorial Analysis of Concessive Clause Positioning 让步条款定位的多因素分析

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-03-11 DOI: 10.1080/09296174.2020.1737488

Huimyung Kang, Jiajin Xu

ABSTRACT Previous works have identified multiple factors and their interplay that condition the positioning of the concessive adverbial clauses. This study continues this line of research by 1) focusing exclusively on the positioning of although-led concessive adverbial clauses (although-clauses hereafter) among different concessive clause relations; 2) supplementing the factor set with more linguistic features, such as sentence-initial adverbials and hedging terms; and, 3) extending and generalizing the scope of competition among semantic, discoursal and processing motivators to a higher-level competition between ‘clarity’ and ‘processability’. Data were retrieved from 1,738 concessive sentences of student argumentative essays from the BAWE and NESSIE corpora. Models were generated based on binary logistic regression and random forests. The results show that the motivator of the relationship between the although-clauses and their main clauses was the most significant variable in all models, denoting its priority in conditioning concessive clause positioning, under the Competition Model framework. Subordinate clause complexity and deranking (i.e. clauses that do not have a full verb) were the least significant among all motivating factors. Overall, clarity-related motivators outweigh processability-related ones, prioritizing clear meaning-conveying in competition with processing motivators.

以往的工作已经确定了制约让步状语从句定位的多种因素及其相互作用。本研究继续了这一研究路线：1）专门关注尽管引出的让步状语从句（尽管从句在下文中）在不同让步从句关系中的定位；2）为因子集补充更多的语言特征，如句首状语和对冲词；以及，3）将语义、话语和加工动机之间的竞争范围扩展和概括为“清晰度”和“可加工性”之间的更高层次竞争。从BAWE和NESSIE语料库中检索了1738篇学生议论文的让步句。模型是基于二元逻辑回归和随机森林生成的。结果表明，在竞争模型框架下，尽管从句及其主从句之间关系的激励因素是所有模型中最显著的变量，表明其在条件让步从句定位方面具有优先地位。在所有的激励因素中，从句的复杂性和去链接（即没有完整动词的从句）是最不重要的。总的来说，与清晰度相关的动机超过了与可加工性相关的动机，在与加工动机的竞争中，优先考虑清晰的意思传达。

{"title":"A Multifactorial Analysis of Concessive Clause Positioning","authors":"Huimyung Kang, Jiajin Xu","doi":"10.1080/09296174.2020.1737488","DOIUrl":"https://doi.org/10.1080/09296174.2020.1737488","url":null,"abstract":"ABSTRACT Previous works have identified multiple factors and their interplay that condition the positioning of the concessive adverbial clauses. This study continues this line of research by 1) focusing exclusively on the positioning of although-led concessive adverbial clauses (although-clauses hereafter) among different concessive clause relations; 2) supplementing the factor set with more linguistic features, such as sentence-initial adverbials and hedging terms; and, 3) extending and generalizing the scope of competition among semantic, discoursal and processing motivators to a higher-level competition between ‘clarity’ and ‘processability’. Data were retrieved from 1,738 concessive sentences of student argumentative essays from the BAWE and NESSIE corpora. Models were generated based on binary logistic regression and random forests. The results show that the motivator of the relationship between the although-clauses and their main clauses was the most significant variable in all models, denoting its priority in conditioning concessive clause positioning, under the Competition Model framework. Subordinate clause complexity and deranking (i.e. clauses that do not have a full verb) were the least significant among all motivating factors. Overall, clarity-related motivators outweigh processability-related ones, prioritizing clear meaning-conveying in competition with processing motivators.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"356 - 380"},"PeriodicalIF":1.4,"publicationDate":"2020-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1737488","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44089175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Analysis of Transitional Areas in Dialectology: Approach with Fuzzy Logic 方言过渡区分析:模糊逻辑的方法

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-03-03 DOI: 10.1080/09296174.2020.1732765

Gotzon Aurrekoetxea, Aitor Iglesias, E. Clua, I. Usobiaga, M. Salicrú

ABSTRACT Comparing the dialectal classifications into disjointed zones with the representation of populations in a geolectal continuum has emphasized the importance of transition regions. Identifying these regions has been the subject of study in the scientific literature, although research has not been conducted in a reliable manner. Based on the Basque ‘Bourciez’ Corpus, we have highlighted the limitations of dialectal classifications using deterministic methods along with the possibilities provided by fuzzy logic. By contributing objectivity to the analysis, the C-means classification has allowed us to retain information from the deterministic classification, identify transition regions, emphasize the geolectal continuum and minimize the artificial isolation of certain populations in the classification. Classifying the French-Basque territory into two groups has separated the populations into two nearly-disjointed dialectal zones. Classifications into three and four groups have underscored the broad overlap between adjacent linguistic zones. This paper’s contribution has provided a new explanatory dimension and consequently improves the linguistic interpretation. In this sense, the results are in accordance with the previous contributions described in the literature and have justified the integration of different viewpoints.

摘要将方言分类为不相交区域与地理连续统中的人口表示进行比较，强调了过渡区域的重要性。确定这些区域一直是科学文献研究的主题，尽管研究尚未以可靠的方式进行。基于巴斯克语的布尔西兹语料库，我们强调了使用确定性方法进行方言分类的局限性以及模糊逻辑提供的可能性。通过为分析提供客观性，C均值分类使我们能够保留确定性分类的信息，识别过渡区域，强调地理连续性，并最大限度地减少分类中某些群体的人为隔离。将法属巴斯克地区分为两组，将人口分为两个几乎不相交的方言区。将其分为三组和四组，突显了相邻语言区域之间的广泛重叠。本文的贡献提供了一个新的解释维度，从而改进了语言学解释。从这个意义上说，这些结果与文献中描述的先前贡献一致，并证明了不同观点的整合是合理的。

{"title":"Analysis of Transitional Areas in Dialectology: Approach with Fuzzy Logic","authors":"Gotzon Aurrekoetxea, Aitor Iglesias, E. Clua, I. Usobiaga, M. Salicrú","doi":"10.1080/09296174.2020.1732765","DOIUrl":"https://doi.org/10.1080/09296174.2020.1732765","url":null,"abstract":"ABSTRACT Comparing the dialectal classifications into disjointed zones with the representation of populations in a geolectal continuum has emphasized the importance of transition regions. Identifying these regions has been the subject of study in the scientific literature, although research has not been conducted in a reliable manner. Based on the Basque ‘Bourciez’ Corpus, we have highlighted the limitations of dialectal classifications using deterministic methods along with the possibilities provided by fuzzy logic. By contributing objectivity to the analysis, the C-means classification has allowed us to retain information from the deterministic classification, identify transition regions, emphasize the geolectal continuum and minimize the artificial isolation of certain populations in the classification. Classifying the French-Basque territory into two groups has separated the populations into two nearly-disjointed dialectal zones. Classifications into three and four groups have underscored the broad overlap between adjacent linguistic zones. This paper’s contribution has provided a new explanatory dimension and consequently improves the linguistic interpretation. In this sense, the results are in accordance with the previous contributions described in the literature and have justified the integration of different viewpoints.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"337 - 355"},"PeriodicalIF":1.4,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1732765","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47674982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity 基于困惑度的三种语言间长期语言距离测量方法

IF 1.4 2区文学 0 LANGUAGE & LINGUISTICS

Journal of Quantitative Linguistics

Pub Date : 2020-03-01 DOI: 10.1080/09296174.2020.1732177

José Ramom Pichel, Pablo Gamallo, I. Alegria, Marco Neves

ABSTRACT The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.

摘要本文的目的是应用一种基于语料库的方法，基于困惑度的测量，自动计算三种语言历史时期之间的跨语言距离。这三个历史语料库是在小说和非小说的平衡基础上，以最接近原作的拼写构建和收集的。从中世纪到20世纪末，这种方法已被应用于测量加利西亚语与葡萄牙语和西班牙语在原始拼写和自动转录拼写方面的历史距离。定量结果与从历史语言学专家那里提取的假设进行了对比。结果表明，加利西亚语和葡萄牙语在中世纪是同一语言的变体，而加利西亚语自19世纪末以来与葡萄牙语和西班牙语融合和分化。在这个过程中，正字法起着相关的作用。需要指出的是，该方法是无监督的，可以应用于其他语言。

{"title":"A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity","authors":"José Ramom Pichel, Pablo Gamallo, I. Alegria, Marco Neves","doi":"10.1080/09296174.2020.1732177","DOIUrl":"https://doi.org/10.1080/09296174.2020.1732177","url":null,"abstract":"ABSTRACT The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"306 - 336"},"PeriodicalIF":1.4,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1732177","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47311819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4