The meaning distributions of certain linguistic forms generally follow a Zipfian distribution. However, since the meanings can be observed and classified on different levels of granularity, it is thus interesting to ask whether their distributions on different levels can be fitted by the same model and whether the parameters are the same. In this study, we investigate three quasi-prepositions in Shanghainese, a dialect of Wu Chinese, and test whether the meaning distributions on two levels of granularity can be fitted by the same model and whether the parameters are close. The results first show that the three models proposed by modern quantitative linguists can both achieve a good fit for all cases, while both the exponential (EXP) model and the right-truncated negative binomial (RTBN) models behave better than the modified right-truncated Zipf-Alekseev distribution (MRTZA), in terms of the consistency of the goodness of fit, parameter change, rationality, and simplicity. Second, the parameters of the distributions on the two levels and the curves are not exactly the same or even close to each other. This has supported a weak view of the concept of ‘scaling’ in complex sciences. Finally, differences are found to lie between the distributions on the two levels. The fine-grained meaning distributions are more right-skewed and more non-linear. This is attributed to the openness of the categories of systems. The finer semantic differentiation behaves like systems with open set of categories, while the coarse-grained meaning distribution resembles those having a close set of few categories.
{"title":"The meaning distributions on different levels of granularity","authors":"T. Yih, Haitao Liu","doi":"10.53482/2023_54_405","DOIUrl":"https://doi.org/10.53482/2023_54_405","url":null,"abstract":"The meaning distributions of certain linguistic forms generally follow a Zipfian distribution. However, since the meanings can be observed and classified on different levels of granularity, it is thus interesting to ask whether their distributions on different levels can be fitted by the same model and whether the parameters are the same. In this study, we investigate three quasi-prepositions in Shanghainese, a dialect of Wu Chinese, and test whether the meaning distributions on two levels of granularity can be fitted by the same model and whether the parameters are close. The results first show that the three models proposed by modern quantitative linguists can both achieve a good fit for all cases, while both the exponential (EXP) model and the right-truncated negative binomial (RTBN) models behave better than the modified right-truncated Zipf-Alekseev distribution (MRTZA), in terms of the consistency of the goodness of fit, parameter change, rationality, and simplicity. Second, the parameters of the distributions on the two levels and the curves are not exactly the same or even close to each other. This has supported a weak view of the concept of ‘scaling’ in complex sciences. Finally, differences are found to lie between the distributions on the two levels. The fine-grained meaning distributions are more right-skewed and more non-linear. This is attributed to the openness of the categories of systems. The finer semantic differentiation behaves like systems with open set of categories, while the coarse-grained meaning distribution resembles those having a close set of few categories.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"93 1","pages":"13-38"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74544443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thematic concentration, a quantitative linguistic method, can reflect the speech style of a particular person. It may, to some degree, reflect the degree of a speaker’s intention to communicate certain themes. There has been limited empirical research on the similarity between Trump and Putin with respect to their linguistic features. Thus, the present study aims to compare Putin’s and Trump’s stylometric features and political themes based on thematic concentration with a corpus of Putin’s, Medvedev’s, Trump’s, and Obama’s speeches. Results show that 1) Both Putin’s and Trump’s speeches’ thematic concentration values are significantly or marginally significantly different from their precedents’. 2) Two leaders pay great attention to the concept of nationalism. 3) Thematic words of their speeches were slightly different across periods, reflecting the influence of external factors on the choice of language. The results of the present study may shed light on the stylometric studies of Putin and Trump.
{"title":"Fellow or foe? A quantitative thematic exploration into Putin's and Trump's stylometric features","authors":"Yaqin Wang, Ting Zeng","doi":"10.53482/2023_54_406","DOIUrl":"https://doi.org/10.53482/2023_54_406","url":null,"abstract":"Thematic concentration, a quantitative linguistic method, can reflect the speech style of a particular person. It may, to some degree, reflect the degree of a speaker’s intention to communicate certain themes. There has been limited empirical research on the similarity between Trump and Putin with respect to their linguistic features. Thus, the present study aims to compare Putin’s and Trump’s stylometric features and political themes based on thematic concentration with a corpus of Putin’s, Medvedev’s, Trump’s, and Obama’s speeches. Results show that 1) Both Putin’s and Trump’s speeches’ thematic concentration values are significantly or marginally significantly different from their precedents’. 2) Two leaders pay great attention to the concept of nationalism. 3) Thematic words of their speeches were slightly different across periods, reflecting the influence of external factors on the choice of language. The results of the present study may shed light on the stylometric studies of Putin and Trump.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"1 1","pages":"39-57"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86828865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article compares the performance of two term specificity measures, Cohen’s d and Z-score, when analyzing political and media discourses on Russia’s war in Ukraine in four languages and five countries. In addition to linguistic and stylistic heterogeneity, 3,347 texts included in the corpus have variable length. The two measures display convergent validity, as confirmed by various performance metrics. It is argued that the measures can be adapted to a broader range of tasks in information retrieval and digital humanities, in addition to their usefulness for text mining and content analysis.
{"title":"A comparison of two text specificity measures analyzing a heterogenous text corpus","authors":"A. Oleinik","doi":"10.53482/2023_54_404","DOIUrl":"https://doi.org/10.53482/2023_54_404","url":null,"abstract":"The article compares the performance of two term specificity measures, Cohen’s d and Z-score, when analyzing political and media discourses on Russia’s war in Ukraine in four languages and five countries. In addition to linguistic and stylistic heterogeneity, 3,347 texts included in the corpus have variable length. The two measures display convergent validity, as confirmed by various performance metrics. It is argued that the measures can be adapted to a broader range of tasks in information retrieval and digital humanities, in addition to their usefulness for text mining and content analysis.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"275 1","pages":"1-12"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78868378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article deals with the history of quantitative linguistics. The focus of this paper is the journal SMIL – Statistical Methods in Linguistics, which was published by Hans Karlgren in Stockholm from 1962 to 1976 (with a short interruption between 1966 and 1969). SMIL is a representative example of the process of differentiation in quantitative linguistics during the seventies and can be seen as one early major “Scandinavian” contribution to statistical and quantitative linguistics.
这篇文章论述了数量语言学的历史。本文的研究重点是Hans Karlgren于1962年至1976年在斯德哥尔摩出版的《语言学统计方法》(SMIL - Statistical Methods in Linguistics)杂志(1966年至1969年短暂中断)。SMIL是70年代数量语言学分化过程的代表性例子,可以被视为早期“斯堪的纳维亚”对统计和数量语言学的主要贡献。
{"title":"The journal SMIL - Statistical Methods in Linguistics (1962-1976) - some notes about the history of quantitative linguistics in Scandinavia and beyond","authors":"E. Kelih","doi":"10.53482/2023_54_408","DOIUrl":"https://doi.org/10.53482/2023_54_408","url":null,"abstract":"This article deals with the history of quantitative linguistics. The focus of this paper is the journal SMIL – Statistical Methods in Linguistics, which was published by Hans Karlgren in Stockholm from 1962 to 1976 (with a short interruption between 1966 and 1969). SMIL is a representative example of the process of differentiation in quantitative linguistics during the seventies and can be seen as one early major “Scandinavian” contribution to statistical and quantitative linguistics.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"22 1","pages":"88-98"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78264957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-13DOI: 10.48550/arXiv.2211.07005
Peng Liu, Tinghao Feng, Rui Liu
We introduce a graph polynomial that distinguishes tree structures to represent dependency grammar and a measure based on the polynomial representation to quantify syntax similarity. The polynomial encodes accurate and comprehensive information about the dependency structure and dependency relations of words in a sentence, which enables in-depth analysis of dependency trees with data analysis tools. We apply the polynomial-based methods to analyze sentences in the ParallelUniversal Dependencies treebanks. Specifically, we compare the syntax of sentences and their translations in different languages, and we perform a syntactic typology study of available languages in the Parallel Universal Dependencies treebanks. We also demonstrate and discuss the potential of the methods in measuring syntax diversity of corpora.
{"title":"Quantifying syntax similarity with a polynomial representation of dependency trees","authors":"Peng Liu, Tinghao Feng, Rui Liu","doi":"10.48550/arXiv.2211.07005","DOIUrl":"https://doi.org/10.48550/arXiv.2211.07005","url":null,"abstract":"We introduce a graph polynomial that distinguishes tree structures to represent dependency grammar and a measure based on the polynomial representation to quantify syntax similarity. The polynomial encodes accurate and comprehensive information about the dependency structure and dependency relations of words in a sentence, which enables in-depth analysis of dependency trees with data analysis tools. We apply the polynomial-based methods to analyze sentences in the ParallelUniversal Dependencies treebanks. Specifically, we compare the syntax of sentences and their translations in different languages, and we perform a syntactic typology study of available languages in the Parallel Universal Dependencies treebanks. We also demonstrate and discuss the potential of the methods in measuring syntax diversity of corpora.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"34 1","pages":"59-79"},"PeriodicalIF":0.0,"publicationDate":"2022-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84534693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The aim of this study is to find parameters that can be used for classification of not very long texts, for example, by author, genre, etc. We go through various known parameters and analyze to what extent they are useful for the intended purposes. We also suggest some improvements that need to be checked further. We calculate the values of parameters at various points of text comprising N tokens (running words) counted from the beginning of text. As parameters with prospects of author and/or language attribution we identify, in particular, the h-point scaling coefficient, Yule’s K, relative repeat rate, and the fraction of dis legomena. These parameters demonstrate quite stable behavior in N. Another set includes scaling exponents of parameters with respect to N. Certain modifications are suggested for Lambda and entropy introducing logarithmic corrections being powers of ln N. The results are applicable for texts of thousands to tens of thousand words.
{"title":"Attempting at parametrization of moderate-length poetic texts: Moses, a poem by Ivan Franko","authors":"S. Buk, Andrij Rovenchak","doi":"10.53482/2022_53_399","DOIUrl":"https://doi.org/10.53482/2022_53_399","url":null,"abstract":"The aim of this study is to find parameters that can be used for classification of not very long texts, for example, by author, genre, etc. We go through various known parameters and analyze to what extent they are useful for the intended purposes. We also suggest some improvements that need to be checked further. We calculate the values of parameters at various points of text comprising N tokens (running words) counted from the beginning of text. As parameters with prospects of author and/or language attribution we identify, in particular, the h-point scaling coefficient, Yule’s K, relative repeat rate, and the fraction of dis legomena. These parameters demonstrate quite stable behavior in N. Another set includes scaling exponents of parameters with respect to N. Certain modifications are suggested for Lambda and entropy introducing logarithmic corrections being powers of ln N. The results are applicable for texts of thousands to tens of thousand words.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"6 1","pages":"1-23"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78954329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drawing on word embeddings techniques and tracking the frequency and semantic change of hot words on Sina Weibo during the COVID-19 pandemic, this study investigates how language and discourse change during crisis. More specifically, correlation tests were conducted between word frequency ranks, pandemic data, and word meaning change ratio. Results indicated that the frequency of some hot words changed with both pandemic data and the frequency of other hot words, which were significantly correlated with the American pandemic data rather than that of China. Moreover, February of 2020 saw the most distinctive semantic changes marked by a large part of the nearest neighbors for WAR metaphors. The correlations between changes in the frequency and nearest neighbors of COVID-19 related hot words exhibited some acceptable peculiarities. This study proves the availability of studying discourse through language change by observing minor semantic change on connotation level from social media, which adds a new perspective to the impact of the COVID-19 pandemic.
{"title":"Dynamics of language in social emergency: investigating COVID-19 hot words on Weibo","authors":"Yi Zhou, Rui Li, Guangfeng Chen, Haitao Liu","doi":"10.53482/2022_52_395","DOIUrl":"https://doi.org/10.53482/2022_52_395","url":null,"abstract":"Drawing on word embeddings techniques and tracking the frequency and semantic change of hot words on Sina Weibo during the COVID-19 pandemic, this study investigates how language and discourse change during crisis. More specifically, correlation tests were conducted between word frequency ranks, pandemic data, and word meaning change ratio. Results indicated that the frequency of some hot words changed with both pandemic data and the frequency of other hot words, which were significantly correlated with the American pandemic data rather than that of China. Moreover, February of 2020 saw the most distinctive semantic changes marked by a large part of the nearest neighbors for WAR metaphors. The correlations between changes in the frequency and nearest neighbors of COVID-19 related hot words exhibited some acceptable peculiarities. This study proves the availability of studying discourse through language change by observing minor semantic change on connotation level from social media, which adds a new perspective to the impact of the COVID-19 pandemic.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"875 1","pages":"1-20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76981092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Berhane Abebe, M. Chebunin, A. Kovalevskii, N. Zakrevskaya
The processes of growth in the number of diverse words in a text, when reading in the forward and backward directions, are studied in this article. Based upon the statistics achieved from the difference between these two processes, we construct a statistical test. This statistical test is used for text homogeneity checks. The elementary model states that words in a text are selected from some dictionary independent of each other according to the Zipf–Mandelbrot law. P-values of the statistical test are calculated based on the elementary probabilistic model using the asymptotic normality of corresponding statistics. At last but not least, this statistical test is applied for the analysis of homogeneity of sequences of sonnets.
{"title":"Statistical tests for text homogeneity: using forward and backward processes of numbers of different words","authors":"Berhane Abebe, M. Chebunin, A. Kovalevskii, N. Zakrevskaya","doi":"10.53482/2022_53_401","DOIUrl":"https://doi.org/10.53482/2022_53_401","url":null,"abstract":"The processes of growth in the number of diverse words in a text, when reading in the forward and backward directions, are studied in this article. Based upon the statistics achieved from the difference between these two processes, we construct a statistical test. This statistical test is used for text homogeneity checks. The elementary model states that words in a text are selected from some dictionary independent of each other according to the Zipf–Mandelbrot law. P-values of the statistical test are calculated based on the elementary probabilistic model using the asymptotic normality of corresponding statistics. At last but not least, this statistical test is applied for the analysis of homogeneity of sequences of sonnets.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"37 1","pages":"42-58"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74519626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The novel The Grapes of Wrath is distinctive in the arrangement of intercalary chapters and narrative chapters. Existing studies of the narratological distinction of this novel are primarily qualitative. This article conducted a corpus-driven study of the variation of styles in this novel from the perspectives of word cluster, type-token ratio, descriptivity and activity, keyness, and sentiment. The cluster analysis shows that the choice of words in the narrative chapters is more consistent than that in the intercalary chapters. The type-token ratio analysis testifies to the heterogeneity of the intercalary chapters in terms of lexical richness. The descriptivity and activity analysis and the keyness analysis reveal that the narrative chapters are more active than the intercalary chapters. The sentiment analysis finds that the novel is pervaded by negative sentiments and that negative sentiments are more prevalent in the narrative chapters than in the intercalary chapters. The research concludes that the corpus-driven study can provide insights into the narrative structure and the stylistic variation of the novel.
{"title":"A Corpus-Driven Study of the Style Variation in The Grapes of Wrath","authors":"Yiyang Hu, Qingshun He","doi":"10.53482/2022_52_396","DOIUrl":"https://doi.org/10.53482/2022_52_396","url":null,"abstract":"The novel The Grapes of Wrath is distinctive in the arrangement of intercalary chapters and narrative chapters. Existing studies of the narratological distinction of this novel are primarily qualitative. This article conducted a corpus-driven study of the variation of styles in this novel from the perspectives of word cluster, type-token ratio, descriptivity and activity, keyness, and sentiment. The cluster analysis shows that the choice of words in the narrative chapters is more consistent than that in the intercalary chapters. The type-token ratio analysis testifies to the heterogeneity of the intercalary chapters in terms of lexical richness. The descriptivity and activity analysis and the keyness analysis reveal that the narrative chapters are more active than the intercalary chapters. The sentiment analysis finds that the novel is pervaded by negative sentiments and that negative sentiments are more prevalent in the narrative chapters than in the intercalary chapters. The research concludes that the corpus-driven study can provide insights into the narrative structure and the stylistic variation of the novel.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"10 1","pages":"21-38"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78928455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Book review - On Invisible Language in Modern English: A Corpus-based Approach to Ellipsis. By Evelyn Gandón-Chapela. London: Bloomsbury Academic. 2020","authors":"Zheyuan Dai","doi":"10.53482/2022_52_398","DOIUrl":"https://doi.org/10.53482/2022_52_398","url":null,"abstract":"","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"97 1","pages":"65-69"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80534655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}