首页 > 最新文献

Journal of Quantitative Linguistics最新文献

英文 中文
Authorship Attribution via Coupon-Collector-Type Indices 通过优惠券收集类型索引的作者归属
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-10-01 DOI: 10.1080/09296174.2019.1577939
Lukun Zheng, Huiqiang Zheng
ABSTRACT Authorship attribution is the process of determining the author of a text in question by capturing an author’s writing style based on selected stylistic features. In this paper, we propose a new methodology for authorship attribution based on a profile of indices related to the generalized coupon collector problem, called coupon-collector-type indices. The coupon collector problem and its generalizations are of traditional and recurrent interests. Coupons are drawn one at a time from a population containing n distinct type of coupons. The process continues until a complete set of n distinct coupons is obtained and the total number of draws, , is recorded. We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the coupon-collector-type indices using an empirical bootstrap technique. We validate our proposed methodology using several writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the fifteenth Oz book, whose authorship is disputed between Lyman Frank Baum (1856–1919) and his successor) on the Oz series, Ruth Plumly Thompson (1891–1976).
作者归属是指根据选定的文体特征,通过捕捉作者的写作风格来确定相关文本的作者的过程。在本文中,我们提出了一种新的作者归因方法,该方法基于与广义优惠券收集器问题相关的索引简档,称为优惠券收集器类型索引。优惠券收集者问题及其推广具有传统意义和重复性。优惠券是从包含n种不同类型优惠券的人群中一次抽取一张。该过程继续进行,直到获得一整套n个不同的优惠券,并记录抽奖总数。我们的方法论建立在虚词的基础上。我们通过使用经验自举技术构建优惠券收集器类型指数的置信区间来建立测试程序。我们使用几个作者已知的写作样本来验证我们提出的方法。然后,我们应用这种方法来探讨谁写了《奥兹国》系列的第十五本书的问题,莱曼·弗兰克·鲍姆(1856-1919)和他的继任者鲁思·普卢姆利·汤普森(1891-1976)对其作者身份存在争议。
{"title":"Authorship Attribution via Coupon-Collector-Type Indices","authors":"Lukun Zheng, Huiqiang Zheng","doi":"10.1080/09296174.2019.1577939","DOIUrl":"https://doi.org/10.1080/09296174.2019.1577939","url":null,"abstract":"ABSTRACT Authorship attribution is the process of determining the author of a text in question by capturing an author’s writing style based on selected stylistic features. In this paper, we propose a new methodology for authorship attribution based on a profile of indices related to the generalized coupon collector problem, called coupon-collector-type indices. The coupon collector problem and its generalizations are of traditional and recurrent interests. Coupons are drawn one at a time from a population containing n distinct type of coupons. The process continues until a complete set of n distinct coupons is obtained and the total number of draws, , is recorded. We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the coupon-collector-type indices using an empirical bootstrap technique. We validate our proposed methodology using several writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the fifteenth Oz book, whose authorship is disputed between Lyman Frank Baum (1856–1919) and his successor) on the Oz series, Ruth Plumly Thompson (1891–1976).","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"321 - 333"},"PeriodicalIF":1.4,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1577939","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46698374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Probability Distribution of Represented Sources in Conversations of Adults and Children 成人与儿童对话中所表示语源的概率分布
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-10-01 DOI: 10.1080/09296174.2019.1580812
Wei Guan
ABSTRACT Based upon a taxonomy of represented sources, this paper investigates the quantitative features of two groups – one comprised of adults and the other of seven-year-old children – that employ a variety of sources in everyday representation. Results indicate that: 1) the overall probability distributions of represented sources used by both groups fit well to the modified right truncated Zipf-Alekseev distribution; 2) the R2 value of the adult group is lower than that of the child group, largely due to the prevalence of non-present non-specified human references by the adults; 3) the values of fitting parameters a and n differ significantly between the two groups; 4) while representing from sources is an extralinguistic phenomenon, it nevertheless reflects the similar quality of language (i.e. a human-driven complex adaptive system), and it also offers a better understanding of the quantitative features of this phenomenon. In summary, this study presents the results of several preliminary attempts to study a specific type of extralinguistic phenomenon from a quantitative perspective.
摘要基于表征来源的分类,本文研究了两组在日常表征中使用各种来源的群体的数量特征,一组由成年人组成,另一组由七岁儿童组成。结果表明:1)两组所用代表源的总体概率分布都很好地拟合了修正的右截断Zipf-Alekseev分布;2) 成人组的R2值低于儿童组,这主要是由于成年人普遍存在不存在的非特定人类参考;3) 拟合参数a和n的值在两组之间显著不同;4) 尽管来源表征是一种语言外现象,但它反映了语言的相似性质(即一个由人类驱动的复杂适应系统),也更好地理解了这种现象的数量特征。总之,本研究提供了从定量角度研究一种特定类型的语言外现象的几次初步尝试的结果。
{"title":"Probability Distribution of Represented Sources in Conversations of Adults and Children","authors":"Wei Guan","doi":"10.1080/09296174.2019.1580812","DOIUrl":"https://doi.org/10.1080/09296174.2019.1580812","url":null,"abstract":"ABSTRACT Based upon a taxonomy of represented sources, this paper investigates the quantitative features of two groups – one comprised of adults and the other of seven-year-old children – that employ a variety of sources in everyday representation. Results indicate that: 1) the overall probability distributions of represented sources used by both groups fit well to the modified right truncated Zipf-Alekseev distribution; 2) the R2 value of the adult group is lower than that of the child group, largely due to the prevalence of non-present non-specified human references by the adults; 3) the values of fitting parameters a and n differ significantly between the two groups; 4) while representing from sources is an extralinguistic phenomenon, it nevertheless reflects the similar quality of language (i.e. a human-driven complex adaptive system), and it also offers a better understanding of the quantitative features of this phenomenon. In summary, this study presents the results of several preliminary attempts to study a specific type of extralinguistic phenomenon from a quantitative perspective.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"334 - 360"},"PeriodicalIF":1.4,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1580812","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48246585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistic Accommodation in Teenagers’ Social Media Writing: Convergence Patterns in Mixed-gender Conversations 青少年社交媒体写作中的语言调节:混合性别对话中的趋同模式
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-09-06 DOI: 10.1080/09296174.2020.1807853
Lisa Hilte, R. Vandekerckhove, Walter Daelemans
ABSTRACT The present study analyzes the phenomenon of linguistic accommodation, i.e. the adaptation of one’s language use to that of one’s conversation partner. In a large corpus of private social media messages, we compare Flemish teenagers’ writing in two conversational settings: same-gender (including only boys or only girls) and mixed-gender conversations (including at least one girl and one boy). We examine whether boys adopt a more ‘female’ and girls a more ‘male’ writing style in mixed-gender talks, i.e. whether teenagers converge towards their conversation partner with respect to gendered writing. The analyses focus on two sets of prototypical markers of informal online writing, for which a clear gender divide has been attested in previous research: expressive typographic markers (e.g., emoticons), which can be considered more ‘female’ features, and ‘oral’, speech-like markers (e.g., regional language features), which are generally more popular among boys. Using generalized linear-mixed models, we examine the frequency of these features in boys’ and girls’ writing in same- versus mixed-gender conversations. Patterns of convergence emerge from the data: they reveal that girls and boys adopt a more similar style in mixed-gender talks. Strikingly, the convergence is asymmetrical and only significant for a particular group of online language features.
摘要本研究分析了语言调节现象,即一个人的语言使用与对话伙伴的语言使用的适应。在大量私人社交媒体信息中,我们比较了佛兰德青少年在两种对话环境中的写作:同性(只包括男孩或女孩)和混合性别对话(至少包括一个女孩和一个男孩)。我们研究了在男女混合的谈话中,男孩是否采用了更“女性”的写作风格,女孩是否采用了更多“男性”的写作方式,即青少年在性别写作方面是否倾向于他们的谈话伙伴。分析的重点是非正式网络写作的两组典型标记,在之前的研究中已经证明了这两组标记存在明显的性别差异:表现性的排版标记(如表情符号),可以被认为是更多的“女性”特征,以及“口头”、类似言语的标记(如地区语言特征),通常在男孩中更受欢迎。使用广义线性混合模型,我们检验了男孩和女孩在同性别对话和混合性别对话中写作中这些特征的频率。数据显示出趋同的模式:数据显示,女孩和男孩在男女混合的谈话中采用了更相似的风格。引人注目的是,这种趋同是不对称的,并且只对一组特定的在线语言特征有意义。
{"title":"Linguistic Accommodation in Teenagers’ Social Media Writing: Convergence Patterns in Mixed-gender Conversations","authors":"Lisa Hilte, R. Vandekerckhove, Walter Daelemans","doi":"10.1080/09296174.2020.1807853","DOIUrl":"https://doi.org/10.1080/09296174.2020.1807853","url":null,"abstract":"ABSTRACT The present study analyzes the phenomenon of linguistic accommodation, i.e. the adaptation of one’s language use to that of one’s conversation partner. In a large corpus of private social media messages, we compare Flemish teenagers’ writing in two conversational settings: same-gender (including only boys or only girls) and mixed-gender conversations (including at least one girl and one boy). We examine whether boys adopt a more ‘female’ and girls a more ‘male’ writing style in mixed-gender talks, i.e. whether teenagers converge towards their conversation partner with respect to gendered writing. The analyses focus on two sets of prototypical markers of informal online writing, for which a clear gender divide has been attested in previous research: expressive typographic markers (e.g., emoticons), which can be considered more ‘female’ features, and ‘oral’, speech-like markers (e.g., regional language features), which are generally more popular among boys. Using generalized linear-mixed models, we examine the frequency of these features in boys’ and girls’ writing in same- versus mixed-gender conversations. Patterns of convergence emerge from the data: they reveal that girls and boys adopt a more similar style in mixed-gender talks. Strikingly, the convergence is asymmetrical and only significant for a particular group of online language features.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"241 - 268"},"PeriodicalIF":1.4,"publicationDate":"2020-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1807853","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43839097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Frequency, Dispersion and Abstractness in the Lexical Sophistication Analysis of A Learner-Based Word Bank: Dimensionality Reduction and Identification 基于学习者的词库词汇复杂度分析中的频度、离散度和抽象性:降维与识别
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-07-09 DOI: 10.1080/09296174.2020.1782716
H. Zhang, Yuting Han, Xingzi Zhang, Liuran Cui
ABSTRACT The current study incorporated a number of lexical sophistication indices including frequency, dispersion and abstractness of words. A learner-based word bank (inclusive of a Chinese middle-school vocabulary list, a Chinese high-school vocabulary list and a Chinese college-English-test vocabulary list) was manually coded based on two existing corpora: Corpus of Contemporary American English (COCA) and British National Corpus (BNC). Indices of frequency, dispersion and abstractness of the word bank were analysed to shed light on the predetermined categorization of lexical sophistication among second language learners. Based on the principal component analysis, the results demonstrated that dispersion was a unique factor loaded on all entered eight variables while word frequency and abstractness were extracted by the same factor in the learner-based word bank. Moreover, a follow-up MANOVA analysis with post hoc comparisons showed that lexical sophistication indices in general produced pronounced differences among the three levels of word lists. More critically, dispersion was found to be the only significant indicator to differentiate the three levels of word lists. Discussion centred on the uniqueness of dispersion in lexical sophistication and the shared algorithm in frequency and abstractness.
摘要本研究纳入了大量词汇复杂度指标,包括词汇的频率、离散度和抽象度。基于现有的两个语料库:当代美国英语语料库(COCA)和英国国家语料库(BNC),人工编码了一个基于学习者的单词库(包括中国中学词汇表、中国高中词汇表和中国大学英语测试词汇表)。分析了单词库的频率、离散度和抽象性指标,以揭示第二语言学习者对词汇复杂度的预先分类。基于主成分分析,结果表明,在基于学习者的单词库中,分散度是加载在所有输入的八个变量上的唯一因素,而词频和抽象度是由同一因素提取的。此外,后续的MANOVA分析和事后比较表明,词汇复杂度指数通常会在三个级别的单词表之间产生显著差异。更关键的是,分散度被发现是区分单词表三个级别的唯一重要指标。讨论集中在词汇复杂度的离散性的唯一性以及频率和抽象性的共享算法上。
{"title":"Frequency, Dispersion and Abstractness in the Lexical Sophistication Analysis of A Learner-Based Word Bank: Dimensionality Reduction and Identification","authors":"H. Zhang, Yuting Han, Xingzi Zhang, Liuran Cui","doi":"10.1080/09296174.2020.1782716","DOIUrl":"https://doi.org/10.1080/09296174.2020.1782716","url":null,"abstract":"ABSTRACT The current study incorporated a number of lexical sophistication indices including frequency, dispersion and abstractness of words. A learner-based word bank (inclusive of a Chinese middle-school vocabulary list, a Chinese high-school vocabulary list and a Chinese college-English-test vocabulary list) was manually coded based on two existing corpora: Corpus of Contemporary American English (COCA) and British National Corpus (BNC). Indices of frequency, dispersion and abstractness of the word bank were analysed to shed light on the predetermined categorization of lexical sophistication among second language learners. Based on the principal component analysis, the results demonstrated that dispersion was a unique factor loaded on all entered eight variables while word frequency and abstractness were extracted by the same factor in the learner-based word bank. Moreover, a follow-up MANOVA analysis with post hoc comparisons showed that lexical sophistication indices in general produced pronounced differences among the three levels of word lists. More critically, dispersion was found to be the only significant indicator to differentiate the three levels of word lists. Discussion centred on the uniqueness of dispersion in lexical sophistication and the shared algorithm in frequency and abstractness.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"195 - 211"},"PeriodicalIF":1.4,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1782716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48201634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Predictive Modelling of Type Valency in Word Formation Grammar 造词语法中类型价的预测建模
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-07-03 DOI: 10.1080/09296174.2020.1782720
Kateryna Krykoniuk
ABSTRACT This paper explores different regression models for predicting the type valency of Persian suffixes within a usage-based approach. Usage-based models treat the type frequency of a suffix as a key predictor for its type valency revealing that an increase in the type frequency leads to a greater combining power between a construction’s paradigmatic elements. However, this effect is limited to a certain degree by the potential productivity of a suffix, as inferred from the statistically distinguishable negative correlation between the type valency and the potential productivity, as well as from the statistical significance of the variable of the number of hapaxes and the potential productivity in the regression models of conditional inference trees. Moreover, polyvalency as a distinct feature of Persian derivation implies a number of other characteristics, namely greater morphological diversity of patterns, parsability, semantic transparency and larger conversion power of morphemes. This is contrasted with English whose morphemes are predominantly type-monovalent.
摘要本文探讨了基于用法的波斯语后缀类型配价预测的不同回归模型。基于使用的模型将后缀的类型频率作为其类型价的关键预测因子,揭示了类型频率的增加导致结构范例元素之间更大的组合能力。然而,这种影响在一定程度上受到后缀的潜在生产力的限制,从统计上可区分的类型价与潜在生产力之间的负相关,以及从条件推理树回归模型中hapax数量和潜在生产力变量的统计显著性可以推断出来。此外,多价作为波斯语衍生词的一个明显特征意味着许多其他特征,即形态模式的更大多样性、可解析性、语义透明度和更大的语素转换能力。这与英语形成对比,英语的语素主要是类型单价的。
{"title":"Predictive Modelling of Type Valency in Word Formation Grammar","authors":"Kateryna Krykoniuk","doi":"10.1080/09296174.2020.1782720","DOIUrl":"https://doi.org/10.1080/09296174.2020.1782720","url":null,"abstract":"ABSTRACT This paper explores different regression models for predicting the type valency of Persian suffixes within a usage-based approach. Usage-based models treat the type frequency of a suffix as a key predictor for its type valency revealing that an increase in the type frequency leads to a greater combining power between a construction’s paradigmatic elements. However, this effect is limited to a certain degree by the potential productivity of a suffix, as inferred from the statistically distinguishable negative correlation between the type valency and the potential productivity, as well as from the statistical significance of the variable of the number of hapaxes and the potential productivity in the regression models of conditional inference trees. Moreover, polyvalency as a distinct feature of Persian derivation implies a number of other characteristics, namely greater morphological diversity of patterns, parsability, semantic transparency and larger conversion power of morphemes. This is contrasted with English whose morphemes are predominantly type-monovalent.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"212 - 240"},"PeriodicalIF":1.4,"publicationDate":"2020-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1782720","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42407362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Stylometric Features of H. Beam Piper’s Omnilingual 论派珀(H. Beam Piper)多语作品的文体特征
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-07-02 DOI: 10.1080/09296174.2018.1560698
Tomi S. Melka, Michal Místecký
ABSTRACT The article will focus on H. Beam Piper’s classical story Omnilingual (1957). This Piper-esque writing has entered the records of the science fiction prose for the ‘Martian’ periodic table of elements, being synonymous with a scientific ‘Rosetta-like stone’ in the decipherment area. The work, while having a search potential in text analysis and stylistics, may add in a parallel fashion some lustre to the validity of science as a communicative channel in non-conventional circumstances. In order to capture stylistic features of the novelette, a number of quantitative indicators are drawn in. The study will concentrate on vocabulary-richness indexes (TTR, entropy, RR, RRMc, G, ATL, HL, MATTR, and Lambda), a complex assessment of activity (Busemann’s coefficient, the chi-square testing classification), and a sketch of the Belza chain analysis. The goal of the article is to find distinctive features of the piece in question, and point out ways for further research.
本文将聚焦于h·比姆·派珀的经典小说《通语》(1957)。这种派珀式的写作已经进入了“火星”元素周期表的科幻散文记录,与破译领域的科学“罗塞塔式石头”同义。这项工作虽然在文本分析和文体学方面具有搜索潜力,但可能以平行的方式为科学作为非传统环境下的交流渠道的有效性增添一些光彩。为了捕捉中篇小说的风格特征,引入了一些量化指标。研究将集中在词汇丰富度指数(TTR、熵、RR、RRMc、G、ATL、HL、matr和Lambda)、活动的复杂评估(Busemann系数、卡方检验分类)和Belza链分析的草图上。这篇文章的目的是要找到有问题的作品的鲜明特点,并指出进一步研究的方法。
{"title":"On Stylometric Features of H. Beam Piper’s Omnilingual","authors":"Tomi S. Melka, Michal Místecký","doi":"10.1080/09296174.2018.1560698","DOIUrl":"https://doi.org/10.1080/09296174.2018.1560698","url":null,"abstract":"ABSTRACT The article will focus on H. Beam Piper’s classical story Omnilingual (1957). This Piper-esque writing has entered the records of the science fiction prose for the ‘Martian’ periodic table of elements, being synonymous with a scientific ‘Rosetta-like stone’ in the decipherment area. The work, while having a search potential in text analysis and stylistics, may add in a parallel fashion some lustre to the validity of science as a communicative channel in non-conventional circumstances. In order to capture stylistic features of the novelette, a number of quantitative indicators are drawn in. The study will concentrate on vocabulary-richness indexes (TTR, entropy, RR, RRMc, G, ATL, HL, MATTR, and Lambda), a complex assessment of activity (Busemann’s coefficient, the chi-square testing classification), and a sketch of the Belza chain analysis. The goal of the article is to find distinctive features of the piece in question, and point out ways for further research.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"204 - 243"},"PeriodicalIF":1.4,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1560698","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59838520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem 不同大小语料库的词丛比较:Zipfian问题
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-07-02 DOI: 10.1080/09296174.2019.1566975
Yves Bestgen
ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.
语言使用中的公式化序列通常是通过自动识别语料库中频繁出现的一系列单词来研究的,这些单词通常被称为“词汇束”,这些语料库与不同的语域、学科等形成对比。由于语料库的大小通常不同,因此该领域的一个至关重要的假设是使用标准化的频率阈值,例如每百万单词出现20次,可以对不同大小的语料库进行准确的比较。然而,一些研究人员认为,当应用于频率阈值时,归一化可能不可靠。该研究通过比较语料库中仅在大小上不同的词汇束的数量来调查这个问题。采用两种互补的随机抽样方法,从5个语料库中提取10万至200万词的子语料库,并使用两个归一化频率阈值和两个离散阈值识别其中的词汇束。结果表明,较小的亚语料库比较大的亚语料库识别出更多的词汇束。这种大小效应可能与语料库中单词和单词序列分布的Zipfian性质有关。结语部分讨论了避免不同大小语料库中词汇束比较不公平的几种解决方案。
{"title":"Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem","authors":"Yves Bestgen","doi":"10.1080/09296174.2019.1566975","DOIUrl":"https://doi.org/10.1080/09296174.2019.1566975","url":null,"abstract":"ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"272 - 290"},"PeriodicalIF":1.4,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1566975","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44530048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Dependency Distances and Their Frequencies in Indo-European Language 印欧语言中的依赖距离及其频率
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-06-18 DOI: 10.1080/09296174.2020.1771135
Xinying Chen, Kim Gerdes
ABSTRACT The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.
摘要本研究考察了依赖关系的两个特征,即依赖距离和依赖频率之间的关系。这项研究是基于对一个平行依赖树库的分析,该树库包括10种印欧语言。生成两个相应的随机依赖树库作为比较的基线。在计算了这些树库中依赖距离的值及其频率后,对于每种语言,我们将四个函数,即二次函数、指数函数、对数函数和幂律函数,拟合到其原始数据集和随机数据集。初步结果表明,所有10种印欧语言的两个依赖特征之间都存在相关性。该关系可以进一步形式化为幂律函数,该函数可以区分观察到的数据和随机生成的数据集。
{"title":"Dependency Distances and Their Frequencies in Indo-European Language","authors":"Xinying Chen, Kim Gerdes","doi":"10.1080/09296174.2020.1771135","DOIUrl":"https://doi.org/10.1080/09296174.2020.1771135","url":null,"abstract":"ABSTRACT The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"106 - 125"},"PeriodicalIF":1.4,"publicationDate":"2020-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1771135","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43195939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Effect of Translation on Text Coherence: A Quantitative Study 翻译对语篇连贯影响的定量研究
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-06-16 DOI: 10.1080/09296174.2020.1774297
Elham Najafi, Alireza Valizadeh, A. Darooneh, A. Darooneh
ABSTRACT Investigating the coherence of translated texts is an important issue in multilingual studies. In this paper, we aim to study text coherence in human translated texts and its relation to the text properties by a quantitative approach. For this purpose, we assigned a word importance value to each word-type of a text and construct the text ‘importance time series’ from the original and translated texts. Then, we calculated text global coherence by applying Detrended Fluctuation Analysis (DFA) to these time series. By means of this procedure, we were able to compare the coherence of the original and translated texts. Our results show that a translation does not always decrease text coherence, as many people may suppose; there are many cases where text coherence is increased by translation. We also studied the relation of text coherence and the text properties such as text size or vocabulary size; we observed no relevance. Our findings suggest that the coherence of a text depends on the translator’s abilities rather than the state of being original or translated.
摘要研究翻译文本的连贯性是多语言研究中的一个重要问题。本文旨在用定量的方法研究人类翻译文本中的语篇连贯及其与文本性质的关系。为此,我们为文本的每个单词类型分配了一个单词重要性值,并从原始文本和翻译文本构建了文本“重要性时间序列”。然后,我们通过对这些时间序列进行去趋势波动分析(DFA)来计算文本的全局相干性。通过这种方法,我们可以比较原文和译文的连贯性。我们的研究结果表明,翻译并不总是像许多人想象的那样降低文本的连贯性;在许多情况下,通过翻译提高文本的连贯性。本文还研究了文本连贯与文本大小、词汇量等文本属性的关系;我们没有观察到相关性。我们的研究结果表明,文本的连贯取决于译者的能力,而不是原文或译文的状态。
{"title":"The Effect of Translation on Text Coherence: A Quantitative Study","authors":"Elham Najafi, Alireza Valizadeh, A. Darooneh, A. Darooneh","doi":"10.1080/09296174.2020.1774297","DOIUrl":"https://doi.org/10.1080/09296174.2020.1774297","url":null,"abstract":"ABSTRACT Investigating the coherence of translated texts is an important issue in multilingual studies. In this paper, we aim to study text coherence in human translated texts and its relation to the text properties by a quantitative approach. For this purpose, we assigned a word importance value to each word-type of a text and construct the text ‘importance time series’ from the original and translated texts. Then, we calculated text global coherence by applying Detrended Fluctuation Analysis (DFA) to these time series. By means of this procedure, we were able to compare the coherence of the original and translated texts. Our results show that a translation does not always decrease text coherence, as many people may suppose; there are many cases where text coherence is increased by translation. We also studied the relation of text coherence and the text properties such as text size or vocabulary size; we observed no relevance. Our findings suggest that the coherence of a text depends on the translator’s abilities rather than the state of being original or translated.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"151 - 164"},"PeriodicalIF":1.4,"publicationDate":"2020-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1774297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46483654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Quantifying Perceived Political Bias of Newspapers through a Document Classification Technique 通过文件分类技术量化报纸的感知政治偏见
IF 1.4 2区 文学 0 LANGUAGE & LINGUISTICS Pub Date : 2020-06-16 DOI: 10.1080/09296174.2020.1771136
Hyungsuc Kang, Janghoon Yang
ABSTRACT Even though a certain degree of political bias is unavoidable in the media, strong media bias is likely to have an impact on society, especially on the formation of public opinion. This research proposes a data-driven method for quantifying political bias of media contents. With a document classification technique called doc2vec and social data from Facebook posts, a model for analysing the bias is developed. By applying the model to contents of major South Korean newspapers, this paper demonstrates quantitatively that significant political bias exists in the newspapers in line with the perceived political bias.
尽管媒体中存在一定程度的政治偏见是不可避免的,但强烈的媒体偏见很可能对社会产生影响,尤其是对舆论的形成。本研究提出一种数据驱动的方法来量化媒体内容的政治偏见。通过一种名为doc2vec的文档分类技术和来自Facebook帖子的社交数据,开发了一个分析偏见的模型。通过将该模型应用于韩国主要报纸的内容,本文定量地证明了报纸中存在显著的政治偏见,这与感知到的政治偏见一致。
{"title":"Quantifying Perceived Political Bias of Newspapers through a Document Classification Technique","authors":"Hyungsuc Kang, Janghoon Yang","doi":"10.1080/09296174.2020.1771136","DOIUrl":"https://doi.org/10.1080/09296174.2020.1771136","url":null,"abstract":"ABSTRACT Even though a certain degree of political bias is unavoidable in the media, strong media bias is likely to have an impact on society, especially on the formation of public opinion. This research proposes a data-driven method for quantifying political bias of media contents. With a document classification technique called doc2vec and social data from Facebook posts, a model for analysing the bias is developed. By applying the model to contents of major South Korean newspapers, this paper demonstrates quantitatively that significant political bias exists in the newspapers in line with the perceived political bias.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"127 - 150"},"PeriodicalIF":1.4,"publicationDate":"2020-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1771136","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46754063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
Journal of Quantitative Linguistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1