Pub Date : 2020-10-01DOI: 10.1080/09296174.2019.1577939
Lukun Zheng, Huiqiang Zheng
ABSTRACT Authorship attribution is the process of determining the author of a text in question by capturing an author’s writing style based on selected stylistic features. In this paper, we propose a new methodology for authorship attribution based on a profile of indices related to the generalized coupon collector problem, called coupon-collector-type indices. The coupon collector problem and its generalizations are of traditional and recurrent interests. Coupons are drawn one at a time from a population containing n distinct type of coupons. The process continues until a complete set of n distinct coupons is obtained and the total number of draws, , is recorded. We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the coupon-collector-type indices using an empirical bootstrap technique. We validate our proposed methodology using several writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the fifteenth Oz book, whose authorship is disputed between Lyman Frank Baum (1856–1919) and his successor) on the Oz series, Ruth Plumly Thompson (1891–1976).
{"title":"Authorship Attribution via Coupon-Collector-Type Indices","authors":"Lukun Zheng, Huiqiang Zheng","doi":"10.1080/09296174.2019.1577939","DOIUrl":"https://doi.org/10.1080/09296174.2019.1577939","url":null,"abstract":"ABSTRACT Authorship attribution is the process of determining the author of a text in question by capturing an author’s writing style based on selected stylistic features. In this paper, we propose a new methodology for authorship attribution based on a profile of indices related to the generalized coupon collector problem, called coupon-collector-type indices. The coupon collector problem and its generalizations are of traditional and recurrent interests. Coupons are drawn one at a time from a population containing n distinct type of coupons. The process continues until a complete set of n distinct coupons is obtained and the total number of draws, , is recorded. We base our methodology on function words. We establish a testing procedure by constructing a confidence band of the coupon-collector-type indices using an empirical bootstrap technique. We validate our proposed methodology using several writing samples whose authorship is known. We then apply this methodology to explore the question of who wrote the fifteenth Oz book, whose authorship is disputed between Lyman Frank Baum (1856–1919) and his successor) on the Oz series, Ruth Plumly Thompson (1891–1976).","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"321 - 333"},"PeriodicalIF":1.4,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1577939","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46698374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1080/09296174.2019.1580812
Wei Guan
ABSTRACT Based upon a taxonomy of represented sources, this paper investigates the quantitative features of two groups – one comprised of adults and the other of seven-year-old children – that employ a variety of sources in everyday representation. Results indicate that: 1) the overall probability distributions of represented sources used by both groups fit well to the modified right truncated Zipf-Alekseev distribution; 2) the R2 value of the adult group is lower than that of the child group, largely due to the prevalence of non-present non-specified human references by the adults; 3) the values of fitting parameters a and n differ significantly between the two groups; 4) while representing from sources is an extralinguistic phenomenon, it nevertheless reflects the similar quality of language (i.e. a human-driven complex adaptive system), and it also offers a better understanding of the quantitative features of this phenomenon. In summary, this study presents the results of several preliminary attempts to study a specific type of extralinguistic phenomenon from a quantitative perspective.
{"title":"Probability Distribution of Represented Sources in Conversations of Adults and Children","authors":"Wei Guan","doi":"10.1080/09296174.2019.1580812","DOIUrl":"https://doi.org/10.1080/09296174.2019.1580812","url":null,"abstract":"ABSTRACT Based upon a taxonomy of represented sources, this paper investigates the quantitative features of two groups – one comprised of adults and the other of seven-year-old children – that employ a variety of sources in everyday representation. Results indicate that: 1) the overall probability distributions of represented sources used by both groups fit well to the modified right truncated Zipf-Alekseev distribution; 2) the R2 value of the adult group is lower than that of the child group, largely due to the prevalence of non-present non-specified human references by the adults; 3) the values of fitting parameters a and n differ significantly between the two groups; 4) while representing from sources is an extralinguistic phenomenon, it nevertheless reflects the similar quality of language (i.e. a human-driven complex adaptive system), and it also offers a better understanding of the quantitative features of this phenomenon. In summary, this study presents the results of several preliminary attempts to study a specific type of extralinguistic phenomenon from a quantitative perspective.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"334 - 360"},"PeriodicalIF":1.4,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1580812","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48246585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-06DOI: 10.1080/09296174.2020.1807853
Lisa Hilte, R. Vandekerckhove, Walter Daelemans
ABSTRACT The present study analyzes the phenomenon of linguistic accommodation, i.e. the adaptation of one’s language use to that of one’s conversation partner. In a large corpus of private social media messages, we compare Flemish teenagers’ writing in two conversational settings: same-gender (including only boys or only girls) and mixed-gender conversations (including at least one girl and one boy). We examine whether boys adopt a more ‘female’ and girls a more ‘male’ writing style in mixed-gender talks, i.e. whether teenagers converge towards their conversation partner with respect to gendered writing. The analyses focus on two sets of prototypical markers of informal online writing, for which a clear gender divide has been attested in previous research: expressive typographic markers (e.g., emoticons), which can be considered more ‘female’ features, and ‘oral’, speech-like markers (e.g., regional language features), which are generally more popular among boys. Using generalized linear-mixed models, we examine the frequency of these features in boys’ and girls’ writing in same- versus mixed-gender conversations. Patterns of convergence emerge from the data: they reveal that girls and boys adopt a more similar style in mixed-gender talks. Strikingly, the convergence is asymmetrical and only significant for a particular group of online language features.
{"title":"Linguistic Accommodation in Teenagers’ Social Media Writing: Convergence Patterns in Mixed-gender Conversations","authors":"Lisa Hilte, R. Vandekerckhove, Walter Daelemans","doi":"10.1080/09296174.2020.1807853","DOIUrl":"https://doi.org/10.1080/09296174.2020.1807853","url":null,"abstract":"ABSTRACT The present study analyzes the phenomenon of linguistic accommodation, i.e. the adaptation of one’s language use to that of one’s conversation partner. In a large corpus of private social media messages, we compare Flemish teenagers’ writing in two conversational settings: same-gender (including only boys or only girls) and mixed-gender conversations (including at least one girl and one boy). We examine whether boys adopt a more ‘female’ and girls a more ‘male’ writing style in mixed-gender talks, i.e. whether teenagers converge towards their conversation partner with respect to gendered writing. The analyses focus on two sets of prototypical markers of informal online writing, for which a clear gender divide has been attested in previous research: expressive typographic markers (e.g., emoticons), which can be considered more ‘female’ features, and ‘oral’, speech-like markers (e.g., regional language features), which are generally more popular among boys. Using generalized linear-mixed models, we examine the frequency of these features in boys’ and girls’ writing in same- versus mixed-gender conversations. Patterns of convergence emerge from the data: they reveal that girls and boys adopt a more similar style in mixed-gender talks. Strikingly, the convergence is asymmetrical and only significant for a particular group of online language features.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"241 - 268"},"PeriodicalIF":1.4,"publicationDate":"2020-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1807853","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43839097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-09DOI: 10.1080/09296174.2020.1782716
H. Zhang, Yuting Han, Xingzi Zhang, Liuran Cui
ABSTRACT The current study incorporated a number of lexical sophistication indices including frequency, dispersion and abstractness of words. A learner-based word bank (inclusive of a Chinese middle-school vocabulary list, a Chinese high-school vocabulary list and a Chinese college-English-test vocabulary list) was manually coded based on two existing corpora: Corpus of Contemporary American English (COCA) and British National Corpus (BNC). Indices of frequency, dispersion and abstractness of the word bank were analysed to shed light on the predetermined categorization of lexical sophistication among second language learners. Based on the principal component analysis, the results demonstrated that dispersion was a unique factor loaded on all entered eight variables while word frequency and abstractness were extracted by the same factor in the learner-based word bank. Moreover, a follow-up MANOVA analysis with post hoc comparisons showed that lexical sophistication indices in general produced pronounced differences among the three levels of word lists. More critically, dispersion was found to be the only significant indicator to differentiate the three levels of word lists. Discussion centred on the uniqueness of dispersion in lexical sophistication and the shared algorithm in frequency and abstractness.
{"title":"Frequency, Dispersion and Abstractness in the Lexical Sophistication Analysis of A Learner-Based Word Bank: Dimensionality Reduction and Identification","authors":"H. Zhang, Yuting Han, Xingzi Zhang, Liuran Cui","doi":"10.1080/09296174.2020.1782716","DOIUrl":"https://doi.org/10.1080/09296174.2020.1782716","url":null,"abstract":"ABSTRACT The current study incorporated a number of lexical sophistication indices including frequency, dispersion and abstractness of words. A learner-based word bank (inclusive of a Chinese middle-school vocabulary list, a Chinese high-school vocabulary list and a Chinese college-English-test vocabulary list) was manually coded based on two existing corpora: Corpus of Contemporary American English (COCA) and British National Corpus (BNC). Indices of frequency, dispersion and abstractness of the word bank were analysed to shed light on the predetermined categorization of lexical sophistication among second language learners. Based on the principal component analysis, the results demonstrated that dispersion was a unique factor loaded on all entered eight variables while word frequency and abstractness were extracted by the same factor in the learner-based word bank. Moreover, a follow-up MANOVA analysis with post hoc comparisons showed that lexical sophistication indices in general produced pronounced differences among the three levels of word lists. More critically, dispersion was found to be the only significant indicator to differentiate the three levels of word lists. Discussion centred on the uniqueness of dispersion in lexical sophistication and the shared algorithm in frequency and abstractness.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"195 - 211"},"PeriodicalIF":1.4,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1782716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48201634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-03DOI: 10.1080/09296174.2020.1782720
Kateryna Krykoniuk
ABSTRACT This paper explores different regression models for predicting the type valency of Persian suffixes within a usage-based approach. Usage-based models treat the type frequency of a suffix as a key predictor for its type valency revealing that an increase in the type frequency leads to a greater combining power between a construction’s paradigmatic elements. However, this effect is limited to a certain degree by the potential productivity of a suffix, as inferred from the statistically distinguishable negative correlation between the type valency and the potential productivity, as well as from the statistical significance of the variable of the number of hapaxes and the potential productivity in the regression models of conditional inference trees. Moreover, polyvalency as a distinct feature of Persian derivation implies a number of other characteristics, namely greater morphological diversity of patterns, parsability, semantic transparency and larger conversion power of morphemes. This is contrasted with English whose morphemes are predominantly type-monovalent.
{"title":"Predictive Modelling of Type Valency in Word Formation Grammar","authors":"Kateryna Krykoniuk","doi":"10.1080/09296174.2020.1782720","DOIUrl":"https://doi.org/10.1080/09296174.2020.1782720","url":null,"abstract":"ABSTRACT This paper explores different regression models for predicting the type valency of Persian suffixes within a usage-based approach. Usage-based models treat the type frequency of a suffix as a key predictor for its type valency revealing that an increase in the type frequency leads to a greater combining power between a construction’s paradigmatic elements. However, this effect is limited to a certain degree by the potential productivity of a suffix, as inferred from the statistically distinguishable negative correlation between the type valency and the potential productivity, as well as from the statistical significance of the variable of the number of hapaxes and the potential productivity in the regression models of conditional inference trees. Moreover, polyvalency as a distinct feature of Persian derivation implies a number of other characteristics, namely greater morphological diversity of patterns, parsability, semantic transparency and larger conversion power of morphemes. This is contrasted with English whose morphemes are predominantly type-monovalent.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"212 - 240"},"PeriodicalIF":1.4,"publicationDate":"2020-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1782720","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42407362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-02DOI: 10.1080/09296174.2018.1560698
Tomi S. Melka, Michal Místecký
ABSTRACT The article will focus on H. Beam Piper’s classical story Omnilingual (1957). This Piper-esque writing has entered the records of the science fiction prose for the ‘Martian’ periodic table of elements, being synonymous with a scientific ‘Rosetta-like stone’ in the decipherment area. The work, while having a search potential in text analysis and stylistics, may add in a parallel fashion some lustre to the validity of science as a communicative channel in non-conventional circumstances. In order to capture stylistic features of the novelette, a number of quantitative indicators are drawn in. The study will concentrate on vocabulary-richness indexes (TTR, entropy, RR, RRMc, G, ATL, HL, MATTR, and Lambda), a complex assessment of activity (Busemann’s coefficient, the chi-square testing classification), and a sketch of the Belza chain analysis. The goal of the article is to find distinctive features of the piece in question, and point out ways for further research.
{"title":"On Stylometric Features of H. Beam Piper’s Omnilingual","authors":"Tomi S. Melka, Michal Místecký","doi":"10.1080/09296174.2018.1560698","DOIUrl":"https://doi.org/10.1080/09296174.2018.1560698","url":null,"abstract":"ABSTRACT The article will focus on H. Beam Piper’s classical story Omnilingual (1957). This Piper-esque writing has entered the records of the science fiction prose for the ‘Martian’ periodic table of elements, being synonymous with a scientific ‘Rosetta-like stone’ in the decipherment area. The work, while having a search potential in text analysis and stylistics, may add in a parallel fashion some lustre to the validity of science as a communicative channel in non-conventional circumstances. In order to capture stylistic features of the novelette, a number of quantitative indicators are drawn in. The study will concentrate on vocabulary-richness indexes (TTR, entropy, RR, RRMc, G, ATL, HL, MATTR, and Lambda), a complex assessment of activity (Busemann’s coefficient, the chi-square testing classification), and a sketch of the Belza chain analysis. The goal of the article is to find distinctive features of the piece in question, and point out ways for further research.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"204 - 243"},"PeriodicalIF":1.4,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1560698","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"59838520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-02DOI: 10.1080/09296174.2019.1566975
Yves Bestgen
ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.
{"title":"Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem","authors":"Yves Bestgen","doi":"10.1080/09296174.2019.1566975","DOIUrl":"https://doi.org/10.1080/09296174.2019.1566975","url":null,"abstract":"ABSTRACT Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"272 - 290"},"PeriodicalIF":1.4,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2019.1566975","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44530048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-18DOI: 10.1080/09296174.2020.1771135
Xinying Chen, Kim Gerdes
ABSTRACT The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.
{"title":"Dependency Distances and Their Frequencies in Indo-European Language","authors":"Xinying Chen, Kim Gerdes","doi":"10.1080/09296174.2020.1771135","DOIUrl":"https://doi.org/10.1080/09296174.2020.1771135","url":null,"abstract":"ABSTRACT The present study investigates the relationship between two features of dependencies, namely, dependency distances and dependency frequencies. The study is based on the analysis of a parallel dependency treebank that includes 10 Indo-European languages. Two corresponding random dependency treebanks are generated as baselines for comparison. After computing the values of dependency distances and their frequencies in these treebanks, for each lan-guage, we fit four functions, namely quadratic, exponent, logarithm, and power-law func-tions, to its original and random datasets. The preliminary result shows that there is a rela-tion between the two dependency features for all 10 Indo-European languages. The relation can be further formalized as a power-law function which can distinguish the observed data from randomly generated datasets.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"106 - 125"},"PeriodicalIF":1.4,"publicationDate":"2020-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1771135","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43195939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-16DOI: 10.1080/09296174.2020.1774297
Elham Najafi, Alireza Valizadeh, A. Darooneh, A. Darooneh
ABSTRACT Investigating the coherence of translated texts is an important issue in multilingual studies. In this paper, we aim to study text coherence in human translated texts and its relation to the text properties by a quantitative approach. For this purpose, we assigned a word importance value to each word-type of a text and construct the text ‘importance time series’ from the original and translated texts. Then, we calculated text global coherence by applying Detrended Fluctuation Analysis (DFA) to these time series. By means of this procedure, we were able to compare the coherence of the original and translated texts. Our results show that a translation does not always decrease text coherence, as many people may suppose; there are many cases where text coherence is increased by translation. We also studied the relation of text coherence and the text properties such as text size or vocabulary size; we observed no relevance. Our findings suggest that the coherence of a text depends on the translator’s abilities rather than the state of being original or translated.
{"title":"The Effect of Translation on Text Coherence: A Quantitative Study","authors":"Elham Najafi, Alireza Valizadeh, A. Darooneh, A. Darooneh","doi":"10.1080/09296174.2020.1774297","DOIUrl":"https://doi.org/10.1080/09296174.2020.1774297","url":null,"abstract":"ABSTRACT Investigating the coherence of translated texts is an important issue in multilingual studies. In this paper, we aim to study text coherence in human translated texts and its relation to the text properties by a quantitative approach. For this purpose, we assigned a word importance value to each word-type of a text and construct the text ‘importance time series’ from the original and translated texts. Then, we calculated text global coherence by applying Detrended Fluctuation Analysis (DFA) to these time series. By means of this procedure, we were able to compare the coherence of the original and translated texts. Our results show that a translation does not always decrease text coherence, as many people may suppose; there are many cases where text coherence is increased by translation. We also studied the relation of text coherence and the text properties such as text size or vocabulary size; we observed no relevance. Our findings suggest that the coherence of a text depends on the translator’s abilities rather than the state of being original or translated.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"151 - 164"},"PeriodicalIF":1.4,"publicationDate":"2020-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1774297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46483654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-16DOI: 10.1080/09296174.2020.1771136
Hyungsuc Kang, Janghoon Yang
ABSTRACT Even though a certain degree of political bias is unavoidable in the media, strong media bias is likely to have an impact on society, especially on the formation of public opinion. This research proposes a data-driven method for quantifying political bias of media contents. With a document classification technique called doc2vec and social data from Facebook posts, a model for analysing the bias is developed. By applying the model to contents of major South Korean newspapers, this paper demonstrates quantitatively that significant political bias exists in the newspapers in line with the perceived political bias.
{"title":"Quantifying Perceived Political Bias of Newspapers through a Document Classification Technique","authors":"Hyungsuc Kang, Janghoon Yang","doi":"10.1080/09296174.2020.1771136","DOIUrl":"https://doi.org/10.1080/09296174.2020.1771136","url":null,"abstract":"ABSTRACT Even though a certain degree of political bias is unavoidable in the media, strong media bias is likely to have an impact on society, especially on the formation of public opinion. This research proposes a data-driven method for quantifying political bias of media contents. With a document classification technique called doc2vec and social data from Facebook posts, a model for analysing the bias is developed. By applying the model to contents of major South Korean newspapers, this paper demonstrates quantitatively that significant political bias exists in the newspapers in line with the perceived political bias.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"127 - 150"},"PeriodicalIF":1.4,"publicationDate":"2020-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1771136","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46754063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}