Pub Date : 2020-06-10DOI: 10.1080/09296174.2020.1766346
Yaqian Shi, L. Lei
ABSTRACT Text length is a major concern in the measurement of lexical richness, and how lexical richness is affected by text length still remains open. The present study aims to explore the relation between text length and lexical richness from an entropy-based perspective. Results show a non-linear growth pattern of lexical richness by increasing text length. To be specific, lexical richness increases rapidly with shorter texts. It soon reaches a boundary point from which it stabilizes despite the continuous expansion of text length. The boundary point of the lexical richness by the Shannon estimation is around 1000 tokens and that by the Zhang estimation is lower and more varied, including 500, 800, and 1000 tokens. Such stability may be explained by the stabilization of word probability in the text.
{"title":"Lexical Richness and Text Length: An Entropy-based Perspective","authors":"Yaqian Shi, L. Lei","doi":"10.1080/09296174.2020.1766346","DOIUrl":"https://doi.org/10.1080/09296174.2020.1766346","url":null,"abstract":"ABSTRACT Text length is a major concern in the measurement of lexical richness, and how lexical richness is affected by text length still remains open. The present study aims to explore the relation between text length and lexical richness from an entropy-based perspective. Results show a non-linear growth pattern of lexical richness by increasing text length. To be specific, lexical richness increases rapidly with shorter texts. It soon reaches a boundary point from which it stabilizes despite the continuous expansion of text length. The boundary point of the lexical richness by the Shannon estimation is around 1000 tokens and that by the Zhang estimation is lower and more varied, including 500, 800, and 1000 tokens. Such stability may be explained by the stabilization of word probability in the text.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"62 - 79"},"PeriodicalIF":1.4,"publicationDate":"2020-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1766346","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43949045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-07DOI: 10.1080/09296174.2020.1767481
Rui Feng, Congcong Yang, Yunhua Qu
ABSTRACT Recent advances in natural language processing have catalysed active research in designing algorithms to generate contextual vector representations of words, or word embedding, in the machine learning and computational linguistics community. Existing works pay little attention to patterns of words, which encode rich semantic information and impose semantic constraints on a word’s context. This paper explores the feasibility of incorporating word embedding with pattern grammar, a grammar model to describe the syntactic environment of lexical items. Specifically, this research develops a method to extract patterns with semantic information of word embedding and investigates the statistical regularities and distributional semantics of the extracted patterns. The major results of this paper are as follows. Experiments on the LCMC Chinese corpus reveal that the frequency of patterns follows Zipf’s hypothesis, and the frequency and pattern length are inversely related. Therefore, the proposed method enables the study of distributional properties of patterns in large-scale corpora. Furthermore, experiments illustrate that our extracted patterns impose semantic constraints on context, proving that patterns encode rich semantic and contextual information. This sheds light on the potential applications of pattern-based word embedding in a wide range of natural language processing tasks.
{"title":"A Word Embedding Model for Analyzing Patterns and Their Distributional Semantics","authors":"Rui Feng, Congcong Yang, Yunhua Qu","doi":"10.1080/09296174.2020.1767481","DOIUrl":"https://doi.org/10.1080/09296174.2020.1767481","url":null,"abstract":"ABSTRACT Recent advances in natural language processing have catalysed active research in designing algorithms to generate contextual vector representations of words, or word embedding, in the machine learning and computational linguistics community. Existing works pay little attention to patterns of words, which encode rich semantic information and impose semantic constraints on a word’s context. This paper explores the feasibility of incorporating word embedding with pattern grammar, a grammar model to describe the syntactic environment of lexical items. Specifically, this research develops a method to extract patterns with semantic information of word embedding and investigates the statistical regularities and distributional semantics of the extracted patterns. The major results of this paper are as follows. Experiments on the LCMC Chinese corpus reveal that the frequency of patterns follows Zipf’s hypothesis, and the frequency and pattern length are inversely related. Therefore, the proposed method enables the study of distributional properties of patterns in large-scale corpora. Furthermore, experiments illustrate that our extracted patterns impose semantic constraints on context, proving that patterns encode rich semantic and contextual information. This sheds light on the potential applications of pattern-based word embedding in a wide range of natural language processing tasks.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"80 - 105"},"PeriodicalIF":1.4,"publicationDate":"2020-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1767481","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46869414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-24DOI: 10.1080/09296174.2020.1766335
Yue Jiang, Ruimin Ma
ABSTRACT Menzerath–Altmann Law (MAL) is regarded as one of the fundamental laws of language due to its extensive validity for different languages at various linguistic levels and applicability for register differentiation. However, whether MAL holds true for translational language remains to be answered. Translational language, different from both the source language and target original (non-translated) language, is viewed as ‘the third code’. This study delves into the validity of MAL for translated English literary texts and its comparable original texts by exploring the relationship between the sentence length (in number of clauses) and the clause length (in number of words). Results of the study corroborate that MAL held true for both original and translated texts. In addition, both a and b, the fitting parameters of MAL formula, could differentiate the translational language from the original, thus justifying the uniqueness of translational language as ‘the third code’ in its own right. This finding suggests that the fitting parameters might be viable indicators for typological differentiation in translation studies. Further, exploring the dynamic relations between a language construct and its constituents may shed some light on the translating process.
{"title":"Does Menzerath–Altmann Law Hold True for Translational Language: Evidence from Translated English Literary Texts","authors":"Yue Jiang, Ruimin Ma","doi":"10.1080/09296174.2020.1766335","DOIUrl":"https://doi.org/10.1080/09296174.2020.1766335","url":null,"abstract":"ABSTRACT Menzerath–Altmann Law (MAL) is regarded as one of the fundamental laws of language due to its extensive validity for different languages at various linguistic levels and applicability for register differentiation. However, whether MAL holds true for translational language remains to be answered. Translational language, different from both the source language and target original (non-translated) language, is viewed as ‘the third code’. This study delves into the validity of MAL for translated English literary texts and its comparable original texts by exploring the relationship between the sentence length (in number of clauses) and the clause length (in number of words). Results of the study corroborate that MAL held true for both original and translated texts. In addition, both a and b, the fitting parameters of MAL formula, could differentiate the translational language from the original, thus justifying the uniqueness of translational language as ‘the third code’ in its own right. This finding suggests that the fitting parameters might be viable indicators for typological differentiation in translation studies. Further, exploring the dynamic relations between a language construct and its constituents may shed some light on the translating process.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"37 - 61"},"PeriodicalIF":1.4,"publicationDate":"2020-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1766335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47344679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-18DOI: 10.1080/09296174.2020.1737483
Xinlei Jiang, Yue Jiang, C. Hoi
ABSTRACT Queen's English (QE), a linguistic symbol of the royal or upper class, is a particular variety or an aristocratic form of English. However, QE has been dethroned by a surprising finding that it shifted phonologically towards common people's English (CE) between the 1950s-1980s, arousing a debate on its existence. Based upon Queen's Christmas Messages (1952-2018) and BNC, this study quantitatively investigated whether QE has experienced diachronic changes and drifted towards CE. Our PCA analysis shows QE's fluctuating lexical richness, increasing lexical complexity and synthetism, and steady syntactic features during the six decades. Piecewise regression and statistical results indicate 1) QE is drifting towards CE in lexical richness and complexity between the 1950s-1980s; 2) QE exhibits an interaction between a “drifting force” and a “deviating force” towards or from CE between the 1950s-1980s in syntactic features; 3) QE maintains a synthetic form distinct from the analytical one of CE over the 66 years. These phenomena are likely related to the collapsing social structure between the 1950s-1980s, identity building in Queen's early reign and age factor. This study firstly quantify the drift of QE towards CE lexically and syntactically, which may shed some light on quantitative investigation of diachronic language changes.
{"title":"Is Queen’s English Drifting Towards Common People’s English? —Quantifying Diachronic Changes of Queen’s Christmas Messages (1952–2018) with Reference to BNC","authors":"Xinlei Jiang, Yue Jiang, C. Hoi","doi":"10.1080/09296174.2020.1737483","DOIUrl":"https://doi.org/10.1080/09296174.2020.1737483","url":null,"abstract":"ABSTRACT Queen's English (QE), a linguistic symbol of the royal or upper class, is a particular variety or an aristocratic form of English. However, QE has been dethroned by a surprising finding that it shifted phonologically towards common people's English (CE) between the 1950s-1980s, arousing a debate on its existence. Based upon Queen's Christmas Messages (1952-2018) and BNC, this study quantitatively investigated whether QE has experienced diachronic changes and drifted towards CE. Our PCA analysis shows QE's fluctuating lexical richness, increasing lexical complexity and synthetism, and steady syntactic features during the six decades. Piecewise regression and statistical results indicate 1) QE is drifting towards CE in lexical richness and complexity between the 1950s-1980s; 2) QE exhibits an interaction between a “drifting force” and a “deviating force” towards or from CE between the 1950s-1980s in syntactic features; 3) QE maintains a synthetic form distinct from the analytical one of CE over the 66 years. These phenomena are likely related to the collapsing social structure between the 1950s-1980s, identity building in Queen's early reign and age factor. This study firstly quantify the drift of QE towards CE lexically and syntactically, which may shed some light on quantitative investigation of diachronic language changes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"29 1","pages":"1 - 36"},"PeriodicalIF":1.4,"publicationDate":"2020-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1737483","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47278557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-26DOI: 10.1080/09296174.2020.1754611
Wenping Li, Jianwei Yan
ABSTRACT Ouyang and Jiang (2018) measured the second language proficiency of English as a foreign language (EFL) learners based on the probability distribution of dependency distance. However, the typological features of the native language (Chinese) and the target language (English) they adopted are generally considered similar in word order and dependency direction. In addition, their method of classifying the learners’ proficiency levels is based on the learners’ grades, which might weaken the validity of the results. These results are strengthened and verified further in the current research by analysing a treebank of Japanese EFL learners’ interlanguage since their native language and the target language are typologically distinctive. Moreover, the TOEIC score was used as a benchmark to classify the second language proficiency levels of the learners. We found that (1) the mean dependency distance can measure the syntactic complexity of Japanese EFL learners’ interlanguage; (2) constrained by human working memory, the probability distribution of dependency distance based on Japanese EFL learners’ interlanguage follows certain distribution patterns as unveiled in other natural human languages; (3) the parameters of the right truncated modified Zipf-Alekseev distribution can well reflect the changes of the Japanese EFL learners’ second language proficiency, indicating the development of interlanguage.
{"title":"Probability Distribution of Dependency Distance Based on a Treebank of Japanese EFL Learners’ Interlanguage","authors":"Wenping Li, Jianwei Yan","doi":"10.1080/09296174.2020.1754611","DOIUrl":"https://doi.org/10.1080/09296174.2020.1754611","url":null,"abstract":"ABSTRACT Ouyang and Jiang (2018) measured the second language proficiency of English as a foreign language (EFL) learners based on the probability distribution of dependency distance. However, the typological features of the native language (Chinese) and the target language (English) they adopted are generally considered similar in word order and dependency direction. In addition, their method of classifying the learners’ proficiency levels is based on the learners’ grades, which might weaken the validity of the results. These results are strengthened and verified further in the current research by analysing a treebank of Japanese EFL learners’ interlanguage since their native language and the target language are typologically distinctive. Moreover, the TOEIC score was used as a benchmark to classify the second language proficiency levels of the learners. We found that (1) the mean dependency distance can measure the syntactic complexity of Japanese EFL learners’ interlanguage; (2) constrained by human working memory, the probability distribution of dependency distance based on Japanese EFL learners’ interlanguage follows certain distribution patterns as unveiled in other natural human languages; (3) the parameters of the right truncated modified Zipf-Alekseev distribution can well reflect the changes of the Japanese EFL learners’ second language proficiency, indicating the development of interlanguage.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"172 - 186"},"PeriodicalIF":1.4,"publicationDate":"2020-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1754611","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49365214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-02DOI: 10.1080/09296174.2018.1559460
O. Gorina, Natalya S. Tsarakova, Sergey K. Tsarakov
ABSTRACT This paper explores word-frequency patterns when considering text length, authorship, and random distortion of texts. Through a series of experiments, we determined an optimal text size, a phenomenon that was predicted by George Zipf, which sees a minimal discrepancy between calculated and observed frequencies. A graphic representation allowed a plausible explanation behind the existence of this phenomenon. Working on the assumption that distorted texts might disobey Zipf’s Law, we explored correlations among frequencies and text entirety compared with text distortions. Results reveal the crucial role of text length for maintaining Zipfian distribution: randomly chosen sets of words and fragmentary texts of optimal size still obey Zipf’s Law. Findings show that authorship manifests itself through the author constant, defined as the relative frequency of the most frequent words, which remains constant throughout the works of any given author, including randomly chosen text chunks and fragments of sentences of various sizes.
{"title":"Study of Optimal Text Size Phenomenon in Zipf–Mandelbrot’s Distribution on the Bases of Full and Distorted Texts. Author’s Frequency Characteristics and derivation of Hapax Legomena","authors":"O. Gorina, Natalya S. Tsarakova, Sergey K. Tsarakov","doi":"10.1080/09296174.2018.1559460","DOIUrl":"https://doi.org/10.1080/09296174.2018.1559460","url":null,"abstract":"ABSTRACT This paper explores word-frequency patterns when considering text length, authorship, and random distortion of texts. Through a series of experiments, we determined an optimal text size, a phenomenon that was predicted by George Zipf, which sees a minimal discrepancy between calculated and observed frequencies. A graphic representation allowed a plausible explanation behind the existence of this phenomenon. Working on the assumption that distorted texts might disobey Zipf’s Law, we explored correlations among frequencies and text entirety compared with text distortions. Results reveal the crucial role of text length for maintaining Zipfian distribution: randomly chosen sets of words and fragmentary texts of optimal size still obey Zipf’s Law. Findings show that authorship manifests itself through the author constant, defined as the relative frequency of the most frequent words, which remains constant throughout the works of any given author, including randomly chosen text chunks and fragments of sentences of various sizes.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"134 - 158"},"PeriodicalIF":1.4,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1559460","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48207752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-02DOI: 10.1080/09296174.2018.1560122
Andrew Wilson, Rosie Harvey
ABSTRACT Previous work has used Greenberg’s synthetism index to compare three of the Celtic languages – Irish, Welsh, and Breton – but not the other three languages, namely Scottish Gaelic, Manx, and Cornish. This paper extends this earlier work by comparing all six Celtic languages, including two periods of Irish (Early Modern and Present Day). The analysis is based on a random sample of 210 parallel psalm texts (30 for each language). However, Greenberg’s synthetism index is problematic because there are no operational standards for counting morphemes within words. We therefore apply a newer typological indicator (B7), which is based solely on lexical rank-frequency statistics. We also explore whether type-token counts alone can provide similar information. The B7 indicator shows that both varieties of Irish, together with Welsh and Cornish, tend more towards synthetism, whereas Manx tends more towards analytism. Breton and Scottish Gaelic do not show a clear tendency in either direction. Rankings using type-token statistics vary considerably and do not tell the same story.
{"title":"Using Rank-Frequency and Type-Token Statistics to Compare Morphological Typology in the Celtic Languages","authors":"Andrew Wilson, Rosie Harvey","doi":"10.1080/09296174.2018.1560122","DOIUrl":"https://doi.org/10.1080/09296174.2018.1560122","url":null,"abstract":"ABSTRACT Previous work has used Greenberg’s synthetism index to compare three of the Celtic languages – Irish, Welsh, and Breton – but not the other three languages, namely Scottish Gaelic, Manx, and Cornish. This paper extends this earlier work by comparing all six Celtic languages, including two periods of Irish (Early Modern and Present Day). The analysis is based on a random sample of 210 parallel psalm texts (30 for each language). However, Greenberg’s synthetism index is problematic because there are no operational standards for counting morphemes within words. We therefore apply a newer typological indicator (B7), which is based solely on lexical rank-frequency statistics. We also explore whether type-token counts alone can provide similar information. The B7 indicator shows that both varieties of Irish, together with Welsh and Cornish, tend more towards synthetism, whereas Manx tends more towards analytism. Breton and Scottish Gaelic do not show a clear tendency in either direction. Rankings using type-token statistics vary considerably and do not tell the same story.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"27 1","pages":"159 - 186"},"PeriodicalIF":1.4,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2018.1560122","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41466832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-11DOI: 10.1080/09296174.2020.1737488
Huimyung Kang, Jiajin Xu
ABSTRACT Previous works have identified multiple factors and their interplay that condition the positioning of the concessive adverbial clauses. This study continues this line of research by 1) focusing exclusively on the positioning of although-led concessive adverbial clauses (although-clauses hereafter) among different concessive clause relations; 2) supplementing the factor set with more linguistic features, such as sentence-initial adverbials and hedging terms; and, 3) extending and generalizing the scope of competition among semantic, discoursal and processing motivators to a higher-level competition between ‘clarity’ and ‘processability’. Data were retrieved from 1,738 concessive sentences of student argumentative essays from the BAWE and NESSIE corpora. Models were generated based on binary logistic regression and random forests. The results show that the motivator of the relationship between the although-clauses and their main clauses was the most significant variable in all models, denoting its priority in conditioning concessive clause positioning, under the Competition Model framework. Subordinate clause complexity and deranking (i.e. clauses that do not have a full verb) were the least significant among all motivating factors. Overall, clarity-related motivators outweigh processability-related ones, prioritizing clear meaning-conveying in competition with processing motivators.
{"title":"A Multifactorial Analysis of Concessive Clause Positioning","authors":"Huimyung Kang, Jiajin Xu","doi":"10.1080/09296174.2020.1737488","DOIUrl":"https://doi.org/10.1080/09296174.2020.1737488","url":null,"abstract":"ABSTRACT Previous works have identified multiple factors and their interplay that condition the positioning of the concessive adverbial clauses. This study continues this line of research by 1) focusing exclusively on the positioning of although-led concessive adverbial clauses (although-clauses hereafter) among different concessive clause relations; 2) supplementing the factor set with more linguistic features, such as sentence-initial adverbials and hedging terms; and, 3) extending and generalizing the scope of competition among semantic, discoursal and processing motivators to a higher-level competition between ‘clarity’ and ‘processability’. Data were retrieved from 1,738 concessive sentences of student argumentative essays from the BAWE and NESSIE corpora. Models were generated based on binary logistic regression and random forests. The results show that the motivator of the relationship between the although-clauses and their main clauses was the most significant variable in all models, denoting its priority in conditioning concessive clause positioning, under the Competition Model framework. Subordinate clause complexity and deranking (i.e. clauses that do not have a full verb) were the least significant among all motivating factors. Overall, clarity-related motivators outweigh processability-related ones, prioritizing clear meaning-conveying in competition with processing motivators.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"356 - 380"},"PeriodicalIF":1.4,"publicationDate":"2020-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1737488","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44089175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-03DOI: 10.1080/09296174.2020.1732765
Gotzon Aurrekoetxea, Aitor Iglesias, E. Clua, I. Usobiaga, M. Salicrú
ABSTRACT Comparing the dialectal classifications into disjointed zones with the representation of populations in a geolectal continuum has emphasized the importance of transition regions. Identifying these regions has been the subject of study in the scientific literature, although research has not been conducted in a reliable manner. Based on the Basque ‘Bourciez’ Corpus, we have highlighted the limitations of dialectal classifications using deterministic methods along with the possibilities provided by fuzzy logic. By contributing objectivity to the analysis, the C-means classification has allowed us to retain information from the deterministic classification, identify transition regions, emphasize the geolectal continuum and minimize the artificial isolation of certain populations in the classification. Classifying the French-Basque territory into two groups has separated the populations into two nearly-disjointed dialectal zones. Classifications into three and four groups have underscored the broad overlap between adjacent linguistic zones. This paper’s contribution has provided a new explanatory dimension and consequently improves the linguistic interpretation. In this sense, the results are in accordance with the previous contributions described in the literature and have justified the integration of different viewpoints.
{"title":"Analysis of Transitional Areas in Dialectology: Approach with Fuzzy Logic","authors":"Gotzon Aurrekoetxea, Aitor Iglesias, E. Clua, I. Usobiaga, M. Salicrú","doi":"10.1080/09296174.2020.1732765","DOIUrl":"https://doi.org/10.1080/09296174.2020.1732765","url":null,"abstract":"ABSTRACT Comparing the dialectal classifications into disjointed zones with the representation of populations in a geolectal continuum has emphasized the importance of transition regions. Identifying these regions has been the subject of study in the scientific literature, although research has not been conducted in a reliable manner. Based on the Basque ‘Bourciez’ Corpus, we have highlighted the limitations of dialectal classifications using deterministic methods along with the possibilities provided by fuzzy logic. By contributing objectivity to the analysis, the C-means classification has allowed us to retain information from the deterministic classification, identify transition regions, emphasize the geolectal continuum and minimize the artificial isolation of certain populations in the classification. Classifying the French-Basque territory into two groups has separated the populations into two nearly-disjointed dialectal zones. Classifications into three and four groups have underscored the broad overlap between adjacent linguistic zones. This paper’s contribution has provided a new explanatory dimension and consequently improves the linguistic interpretation. In this sense, the results are in accordance with the previous contributions described in the literature and have justified the integration of different viewpoints.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"337 - 355"},"PeriodicalIF":1.4,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1732765","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47674982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-03-01DOI: 10.1080/09296174.2020.1732177
José Ramom Pichel, Pablo Gamallo, I. Alegria, Marco Neves
ABSTRACT The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.
{"title":"A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity","authors":"José Ramom Pichel, Pablo Gamallo, I. Alegria, Marco Neves","doi":"10.1080/09296174.2020.1732177","DOIUrl":"https://doi.org/10.1080/09296174.2020.1732177","url":null,"abstract":"ABSTRACT The aim of this paper is to apply a corpus-based methodology, based on the measure of perplexity, to automatically calculate the cross-lingual language distance between historical periods of three languages. The three historical corpora have been constructed and collected with the closest spelling to the original on a balanced basis of fiction and non-fiction. This methodology has been applied to measure the historical distance of Galician with respect to Portuguese and Spanish, from the Middle Ages to the end of the 20th century, both in original spelling and automatically transcribed spelling. The quantitative results are contrasted with hypotheses extracted from experts in historical linguistics. Results show that Galician and Portuguese are varieties of the same language in the Middle Ages and that Galician converges and diverges with Portuguese and Spanish since the last period of the 19th century. In this process, orthography plays a relevant role. It should be pointed out that the method is unsupervised and can be applied to other languages.","PeriodicalId":45514,"journal":{"name":"Journal of Quantitative Linguistics","volume":"28 1","pages":"306 - 336"},"PeriodicalIF":1.4,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/09296174.2020.1732177","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47311819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}