{"title":"Book review - Language and Text. Data, Models, Information and Applications. By Pawłowski, A., Mačutek, J., Embleton, S., Mikros, G. (Eds.). Amsterdam/Philadelphia: John Benjamins Publishing Company. 2021","authors":"E. Kelih","doi":"10.53482/2022_53_403","DOIUrl":"https://doi.org/10.53482/2022_53_403","url":null,"abstract":"","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"8 1","pages":"80-83"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84298915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates the differences in the mean dependency distances (MDDs) of the English essays in a learner corpus, focusing on the different proficiency levels of learners, and the different dependency types. This study is based on the following three assumptions. Firstly, the MDDs of learners' production increase as proficiency levels increase. Secondly, there is an upper limit over which MDDs do not exceed, as predicted by the Dependency Distance Minimization principle. Finally, different types of dependencies show different tendencies across learners of different proficiency levels. This study attempts to verify these assumptions with substantial learner corpus data, categorized into subcorpora according to learner proficiency. Corpus analyses yield results that support these assumptions. These results are expected to constitute a prerequisite for employing the MDD of an individual learner's production to evaluate his or her proficiency level.
{"title":"Differences of Mean Dependency Distances of English Essays Written by Learners of Different Proficiency Levels","authors":"M. Oya","doi":"10.53482/2022_53_400","DOIUrl":"https://doi.org/10.53482/2022_53_400","url":null,"abstract":"This study investigates the differences in the mean dependency distances (MDDs) of the English essays in a learner corpus, focusing on the different proficiency levels of learners, and the different dependency types. This study is based on the following three assumptions. Firstly, the MDDs of learners' production increase as proficiency levels increase. Secondly, there is an upper limit over which MDDs do not exceed, as predicted by the Dependency Distance Minimization principle. Finally, different types of dependencies show different tendencies across learners of different proficiency levels. This study attempts to verify these assumptions with substantial learner corpus data, categorized into subcorpora according to learner proficiency. Corpus analyses yield results that support these assumptions. These results are expected to constitute a prerequisite for employing the MDD of an individual learner's production to evaluate his or her proficiency level.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"1 1","pages":"24-41"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89822148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Happy Birthday Glottometrics ? On the Occasion of the 50th Issue and 20th Anniversary","authors":"E. Kelih, Radek Čech, Ján Mačutek","doi":"10.53482/383","DOIUrl":"https://doi.org/10.53482/383","url":null,"abstract":"Editorial Note","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"350 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75724063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, the degree to which Persian orthography deviates from transparency is quantified and evaluated. We investigate the relations between graphemes and phonemes in Persian, in which the writing system is not fully representative of the spoken language, mostly due to the omission of the short-vowel graphemes. We measure the degree of the Persian orthographic system transparency using a heuristic mathematical model. We apply the same measures to orthographic systems of other languages and compare the results to those obtained for Persian. The results show a relatively high degree of transparency in Persian when it comes to writing, but a low degree of transparency when it comes to reading. We also consider models that avoid the problems related to the short vowels in Persian and these models demonstrate a considerable decrease of the uncertainty in the Persian orthographic system.
{"title":"The Ambiguity of the Relations between Graphemes and Phonemes in the Persian Orthographic System","authors":"Tayebeh Mosavi Miangah, R. Vulanović","doi":"10.53482/2021_50_385","DOIUrl":"https://doi.org/10.53482/2021_50_385","url":null,"abstract":"In this paper, the degree to which Persian orthography deviates from transparency is quantified and evaluated. We investigate the relations between graphemes and phonemes in Persian, in which the writing system is not fully representative of the spoken language, mostly due to the omission of the short-vowel graphemes. We measure the degree of the Persian orthographic system transparency using a heuristic mathematical model. We apply the same measures to orthographic systems of other languages and compare the results to those obtained for Persian. The results show a relatively high degree of transparency in Persian when it comes to writing, but a low degree of transparency when it comes to reading. We also consider models that avoid the problems related to the short vowels in Persian and these models demonstrate a considerable decrease of the uncertainty in the Persian orthographic system.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"161 1","pages":"9-26"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76628527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many authors have examined the influence of loanwords in languages using statistical methods. However, English loanwords in Mongolian are rarely studied in quantitative linguistics. The results of the present study show that English loanwords in Mongolian share the universal feature of other tested languages, as their frequency distribution abides by Zipf’s Law. In addition, we define and test nine English loanword models depending on borrowing method and parts of speech, and find that the results can be described using a power function.
{"title":"English Loanwords in Mongolian Usage","authors":"Minna Bao, Saheya Brintag, Dabhurbayar Huang","doi":"10.53482/2021_50_386","DOIUrl":"https://doi.org/10.53482/2021_50_386","url":null,"abstract":"Many authors have examined the influence of loanwords in languages using statistical methods. However, English loanwords in Mongolian are rarely studied in quantitative linguistics. The results of the present study show that English loanwords in Mongolian share the universal feature of other tested languages, as their frequency distribution abides by Zipf’s Law. In addition, we define and test nine English loanword models depending on borrowing method and parts of speech, and find that the results can be described using a power function.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"154 1","pages":"27-41"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77509804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.
{"title":"Automatic Identification of Authors' Stylistics and Gender on the Basis of the Corpus of Russian Fiction Using Extended Set-theoretic Model with Collocation Extraction","authors":"Alexandr Osochkin, X. Piotrowska, Vladimir Fomin","doi":"10.53482/2021_50_389","DOIUrl":"https://doi.org/10.53482/2021_50_389","url":null,"abstract":"We present a novel quantitative approach for classification of authors' stylistics and gender differences based on extraction of word collocation. The proposed algorithm attenuates previously described issues of text processing using the vector models. We demonstrate the approach by analyzing a corpus of Russian prose. We discuss different approaches for classification and identification of the author's style implemented by currently-available software solutions and libraries of morphological analysis, methods of parameterization, indexing of texts, artificial intelligence algorithms and knowledge extraction. Our results demonstrate the efficiency and relative advantage of regression decision tree methods in identifying informative frequency indexes in a way that lends itself to their logical interpretation. We develop a toolkit for conducting comparative experiments to assess the effectiveness of classification of natural language text data, using vector, set-theoretic and the author's set-theoretic with collocation extraction models of text representation. Comparing the ability of different methods to identify the style and gender differences of authors of fiction works, we find that the proposed approach incorporating collocation information alleviates some of the previously identified deficiencies and yields overall improvements in the classification accuracy.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"1 1","pages":"76-89"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74469088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Happy Birthday Glottometrics – On the Occasion of the 50th Issue and 20th Anniversary","authors":"E. Kelih, Radek Čech, Ján Mačutek","doi":"10.53482/2021_50_383","DOIUrl":"https://doi.org/10.53482/2021_50_383","url":null,"abstract":"Editorial Note","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"100 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79196468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper proposes a methodology for analyzing the syllabic structure of Tatar words using fiction text data. Syllable construction rules are unique for each language as they are determined by the laws that govern its specific internal structure. However, the issue of the syllable finds a rather superficial description in Tatar grammars. Thus, possible correlations of the syllable structure with morphological features of the language will be examined in this paper. We analyze the distribution of syllable types in Tatar texts and represent their ranked frequencies and theoretical values fitted by means of the Zipf Mandelbrot distribution. The main part of the study is devoted to inquiry into the structure of initial and final syllables. We proceed from the hypothesis that distributions of syllable structures in word-initial and word-final positions should be marked by statistically important differences due to discriminative structural features of stems and affixal chains. The study is based on a selection of obstruent and sonorant consonants. To evaluate statistical significance of these differences, the well-known chi square test is applied.
{"title":"Initial and Final Syllables in Tatar: from Phonotactics to Morphology","authors":"A. Galieva, Zhanna Vavilova","doi":"10.53482/2021_50_388","DOIUrl":"https://doi.org/10.53482/2021_50_388","url":null,"abstract":"The paper proposes a methodology for analyzing the syllabic structure of Tatar words using fiction text data. Syllable construction rules are unique for each language as they are determined by the laws that govern its specific internal structure. However, the issue of the syllable finds a rather superficial description in Tatar grammars. Thus, possible correlations of the syllable structure with morphological features of the language will be examined in this paper. We analyze the distribution of syllable types in Tatar texts and represent their ranked frequencies and theoretical values fitted by means of the Zipf Mandelbrot distribution. The main part of the study is devoted to inquiry into the structure of initial and final syllables. We proceed from the hypothesis that distributions of syllable structures in word-initial and word-final positions should be marked by statistically important differences due to discriminative structural features of stems and affixal chains. The study is based on a selection of obstruent and sonorant consonants. To evaluate statistical significance of these differences, the well-known chi square test is applied.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"6 1","pages":"57-75"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85444903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Song, Yunhua Qu, Xiaonan Zhu, Xiaoying Wang, Yifan Zhang
Multi-dimensional Analysis (MD) is a quantitative corpus-based approach which describes and interprets patterns of register variations through factor analysis of a set of linguistic fea-tures across text varieties, and reveals their systematic relationships with communicative purposes. The model has been employed to explore language variation in many languages (e.g., English, Somali, Nukulaelae Tuvaluan, Korean, and Spanish), yet insufficient research has been carried out on register variation in Mandarin Chinese on a full scale. In this research, 88 linguistic features are tagged in a balanced corpus composed of 20 Mandarin Chinese spoken and written registers. Through factor analysis, five dimensions which consist of 65 linguistic features are identified and interpreted from linguistic and functional perspectives. The first two dimensions, interactive vs. informational discourse and narrative vs. non-narrative concern, are similar to dimensions that have been claimed to constitute universal parameters of register variation in previous MD studies. The exist-ence of two potential universal dimensions suggests that the basic communicative purposes and functions underlying the different languages are markedly similar, given the existing social, cultural, and linguistic dissimilarities. Dimension 4, casual real-time speech with stance, is identified as a distinctive dimension in Mandarin Chinese. Dimension 3, explicit-ness in cohesion and reasoning, and Dimension 5, abstract information, are found to be as-sociated with foreign influence, and their register variation patterns illustrate how foreign contact affects Chinese register variation in a quantitative manner.
{"title":"A Multi-dimensional Approach to Register Variations in Mandarin Chinese","authors":"Jie Song, Yunhua Qu, Xiaonan Zhu, Xiaoying Wang, Yifan Zhang","doi":"10.53482/2021_51_393","DOIUrl":"https://doi.org/10.53482/2021_51_393","url":null,"abstract":"Multi-dimensional Analysis (MD) is a quantitative corpus-based approach which describes and interprets patterns of register variations through factor analysis of a set of linguistic fea-tures across text varieties, and reveals their systematic relationships with communicative purposes. The model has been employed to explore language variation in many languages (e.g., English, Somali, Nukulaelae Tuvaluan, Korean, and Spanish), yet insufficient research has been carried out on register variation in Mandarin Chinese on a full scale. In this research, 88 linguistic features are tagged in a balanced corpus composed of 20 Mandarin Chinese spoken and written registers. Through factor analysis, five dimensions which consist of 65 linguistic features are identified and interpreted from linguistic and functional perspectives. The first two dimensions, interactive vs. informational discourse and narrative vs. non-narrative concern, are similar to dimensions that have been claimed to constitute universal parameters of register variation in previous MD studies. The exist-ence of two potential universal dimensions suggests that the basic communicative purposes and functions underlying the different languages are markedly similar, given the existing social, cultural, and linguistic dissimilarities. Dimension 4, casual real-time speech with stance, is identified as a distinctive dimension in Mandarin Chinese. Dimension 3, explicit-ness in cohesion and reasoning, and Dimension 5, abstract information, are found to be as-sociated with foreign influence, and their register variation patterns illustrate how foreign contact affects Chinese register variation in a quantitative manner.","PeriodicalId":51918,"journal":{"name":"Glottometrics","volume":"135 1","pages":"39-69"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76154281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}