We use an iterative process of multi-gram alignment between associated words in different languages in an attempt to identify cognates. To maximise the amount of data, we use practical orthographies instead of consistently coded phonetic transcriptions. First results indicate that using practical orthographies can be useful, the more so when dealing with large amounts of data.
{"title":"Cognate Identification and Alignment Using Practical Orthographies","authors":"Michael Cysouw, H. Jung","doi":"10.3115/1626516.1626530","DOIUrl":"https://doi.org/10.3115/1626516.1626530","url":null,"abstract":"We use an iterative process of multi-gram alignment between associated words in different languages in an attempt to identify cognates. To maximise the amount of data, we use practical orthographies instead of consistently coded phonetic transcriptions. First results indicate that using practical orthographies can be useful, the more so when dealing with large amounts of data.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121405235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Animesh Mukherjee, M. Choudhury, A. Basu, Niloy Ganguly
In this work, we attempt to capture patterns of co-occurrence across vowel systems and at the same time figure out the nature of the force leading to the emergence of such patterns. For this purpose we define a weighted network where the vowels are the nodes and an edge between two nodes (read vowels) signify their co-occurrence likelihood over the vowel inventories. Through this network we identify communities of vowels, which essentially reflect their patterns of co-occurrence across languages. We observe that in the assortative vowel communities the constituent nodes (read vowels) are largely uncorrelated in terms of their features indicating that they are formed based on the principle of maximal perceptual contrast. However, in the rest of the communities, strong correlations are reflected among the constituent vowels with respect to their features indicating that it is the principle of feature economy that binds them together.
{"title":"Emergence of Community Structures in Vowel Inventories: An Analysis Based on Complex Networks","authors":"Animesh Mukherjee, M. Choudhury, A. Basu, Niloy Ganguly","doi":"10.3115/1626516.1626529","DOIUrl":"https://doi.org/10.3115/1626516.1626529","url":null,"abstract":"In this work, we attempt to capture patterns of co-occurrence across vowel systems and at the same time figure out the nature of the force leading to the emergence of such patterns. For this purpose we define a weighted network where the vowels are the nodes and an edge between two nodes (read vowels) signify their co-occurrence likelihood over the vowel inventories. Through this network we identify communities of vowels, which essentially reflect their patterns of co-occurrence across languages. We observe that in the assortative vowel communities the constituent nodes (read vowels) are largely uncorrelated in terms of their features indicating that they are formed based on the principle of maximal perceptual contrast. However, in the rest of the communities, strong correlations are reflected among the constituent vowels with respect to their features indicating that it is the principle of feature economy that binds them together.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127054999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Monson, J. Carbonell, A. Lavie, Lori S. Levin
Paradigms provide an inherent organizational structure to natural language morphology. ParaMor, our minimally supervised morphology induction algorithm, retrusses the word forms of raw text corpora back onto their paradigmatic skeletons; performing on par with state-of-the-art minimally supervised morphology induction algorithms at morphological analysis of English and German. ParaMor consists of two phases. Our algorithm first constructs sets of affixes closely mimicking the paradigms of a language. And with these structures in hand, ParaMor then annotates word forms with morpheme boundaries. To set ParaMor's few free parameters we analyze a training corpus of Spanish. Without adjusting parameters, we induce the morphological structure of English and German. Adopting the evaluation methodology of Morpho Challenge 2007 (Kurimo et al., 2007), we compare ParaMor's morphological analyses with Morfessor (Creutz, 2006), a modern minimally supervised morphology induction system. ParaMor consistently achieves competitive F1 measures.
范式为自然语言形态提供了一种内在的组织结构。ParaMor,我们的最低监督形态诱导算法,将原始文本语料库的词形式回溯到它们的范式骨架上;在英语和德语的形态学分析中,表现与最先进的最低监督形态学诱导算法相当。ParaMor由两个阶段组成。我们的算法首先构建一组非常模仿语言范例的词缀。有了这些结构,ParaMor就可以用语素边界来注释词形。为了设置ParaMor的几个自由参数,我们分析了一个西班牙语训练语料库。在不调整参数的情况下,我们归纳了英语和德语的词形结构。采用Morpho Challenge 2007 (Kurimo et al., 2007)的评估方法,我们将ParaMor的形态学分析与现代最低监督形态学诱导系统morfesson (Creutz, 2006)进行比较。ParaMor始终如一地达到具有竞争力的F1指标。
{"title":"ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis","authors":"Christian Monson, J. Carbonell, A. Lavie, Lori S. Levin","doi":"10.3115/1626516.1626531","DOIUrl":"https://doi.org/10.3115/1626516.1626531","url":null,"abstract":"Paradigms provide an inherent organizational structure to natural language morphology. ParaMor, our minimally supervised morphology induction algorithm, retrusses the word forms of raw text corpora back onto their paradigmatic skeletons; performing on par with state-of-the-art minimally supervised morphology induction algorithms at morphological analysis of English and German. ParaMor consists of two phases. Our algorithm first constructs sets of affixes closely mimicking the paradigms of a language. And with these structures in hand, ParaMor then annotates word forms with morpheme boundaries. To set ParaMor's few free parameters we analyze a training corpus of Spanish. Without adjusting parameters, we induce the morphological structure of English and German. Adopting the evaluation methodology of Morpho Challenge 2007 (Kurimo et al., 2007), we compare ParaMor's morphological analyses with Morfessor (Creutz, 2006), a modern minimally supervised morphology induction system. ParaMor consistently achieves competitive F1 measures.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130549644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data nonlinearity has historically not been and currently is not an issue in work on exploratory multivariate analysis of language corpora. However, the presence of nonlinearity in data has a fundamental bearing on the conduct of exploratory analysis. The first part of the discussion explains why this is so in principle, and the second exemplifies the explanation via exploratory analysis of the Newcastle Electronic Corpus of Tyneside English (NECTE), an historical speech corpus. The conclusion is that data should be screened for nonlinearity prior to analysis and, if a substantial degree of it is found, a nonlinear analytical method should be used.
{"title":"Data Nonlinearity in Exploratory Multivariate Analysis of Language Corpora","authors":"H. Moisl","doi":"10.3115/1626516.1626528","DOIUrl":"https://doi.org/10.3115/1626516.1626528","url":null,"abstract":"Data nonlinearity has historically not been and currently is not an issue in work on exploratory multivariate analysis of language corpora. However, the presence of nonlinearity in data has a fundamental bearing on the conduct of exploratory analysis. The first part of the discussion explains why this is so in principle, and the second exemplifies the explanation via exploratory analysis of the Newcastle Electronic Corpus of Tyneside English (NECTE), an historical speech corpus. The conclusion is that data should be screened for nonlinearity prior to analysis and, if a substantial degree of it is found, a nonlinear analytical method should be used.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122226994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we use the Reeks Nederlandse Dialectatlassen as a source for the reconstruction of a 'proto-language' of Dutch dialects. We used 360 dialects from locations in the Netherlands, the northern part of Belgium and French-Flanders. The density of dialect locations is about the same everywhere. For each dialect we reconstructed 85 words. For the reconstruction of vowels we used knowledge of Dutch history, and for the reconstruction of consonants we used well-known tendencies found in most textbooks about historical linguistics. We validated results by comparing the reconstructed forms with pronunciations according to a proto-Germanic dictionary (Kobler, 2003). For 46% of the words we reconstructed the same vowel or the closest possible vowel when the vowel to be reconstructed was not found in the dialect material. For 52% of the words all consonants we reconstructed were the same. For 42% of the words, only one consonant was differently reconstructed. We measured the divergence of Dutch dialects from their 'proto-language'. We measured pronunciation distances to the proto-language we reconstructed ourselves and correlated them with pronunciation distances we measured to proto-Germanic based on the dictionary. Pronunciation distances were measured using Levenshtein distance, a string edit distance measure. We found a relatively strong correlation (r=0.87).
在本文中,我们使用Reeks Nederlandse Dialectatlassen作为重建荷兰方言“原始语言”的来源。我们使用了来自荷兰、比利时北部和法属佛兰德斯地区的360种方言。方言分布的密度在任何地方都差不多。对于每种方言,我们重建了85个单词。为了重建元音,我们使用了荷兰历史知识,为了重建辅音,我们使用了大多数历史语言学教科书中众所周知的趋势。我们根据原始日耳曼语词典(Kobler, 2003)将重建的形式与发音进行比较,验证了结果。对于46%的单词,当要重构的元音在方言材料中找不到时,我们重构了相同的元音或最接近的元音。对于52%的单词,我们重建的所有辅音都是相同的。在42%的单词中,只有一个辅音有不同的重构。我们测量了荷兰方言与其“原始语言”之间的差异。我们测量了与原始语言的发音距离,我们重建了自己,并将它们与基于字典的原始日耳曼语的发音距离相关联。发音距离使用Levenshtein距离测量,这是一种字符串编辑距离测量。我们发现相关性相对较强(r=0.87)。
{"title":"The Relative Divergence of Dutch Dialect Pronunciations from their Common Source: An Exploratory Study","authors":"W. Heeringa, B. Joseph","doi":"10.3115/1626516.1626521","DOIUrl":"https://doi.org/10.3115/1626516.1626521","url":null,"abstract":"In this paper we use the Reeks Nederlandse Dialectatlassen as a source for the reconstruction of a 'proto-language' of Dutch dialects. We used 360 dialects from locations in the Netherlands, the northern part of Belgium and French-Flanders. The density of dialect locations is about the same everywhere. For each dialect we reconstructed 85 words. For the reconstruction of vowels we used knowledge of Dutch history, and for the reconstruction of consonants we used well-known tendencies found in most textbooks about historical linguistics. We validated results by comparing the reconstructed forms with pronunciations according to a proto-Germanic dictionary (Kobler, 2003). For 46% of the words we reconstructed the same vowel or the closest possible vowel when the vowel to be reconstructed was not found in the dialect material. For 52% of the words all consonants we reconstructed were the same. For 42% of the words, only one consonant was differently reconstructed. We measured the divergence of Dutch dialects from their 'proto-language'. We measured pronunciation distances to the proto-language we reconstructed ourselves and correlated them with pronunciation distances we measured to proto-Germanic based on the dictionary. Pronunciation distances were measured using Levenshtein distance, a string edit distance measure. We found a relatively strong correlation (r=0.87).","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116173055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We apply algorithms for the identification of cognates and recurrent sound correspondences proposed by Kondrak (2002) to the Totonac-Tepehua family of indigenous languages in Mexico. We show that by combining expert linguistic knowledge with computational analysis, it is possible to quickly identify a large number of cognate sets within the family. Our objective is to provide tools for rapid construction of comparative dictionaries for relatively unfamiliar language families.
{"title":"Creating a Comparative Dictionary of Totonac-Tepehua","authors":"Grzegorz Kondrak, D. Beck, Philip Dilts","doi":"10.3115/1626516.1626533","DOIUrl":"https://doi.org/10.3115/1626516.1626533","url":null,"abstract":"We apply algorithms for the identification of cognates and recurrent sound correspondences proposed by Kondrak (2002) to the Totonac-Tepehua family of indigenous languages in Mexico. We show that by combining expert linguistic knowledge with computational analysis, it is possible to quickly identify a large number of cognate sets within the family. Our objective is to provide tools for rapid construction of comparative dictionaries for relatively unfamiliar language families.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121765089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the proceedings from the workshop 'Computing and Historical Phonology: 9th Meeting of the ACL Special Interest Group for Computational Morphology and Phonology'.
我们介绍了“计算和历史音韵学:ACL计算形态学和音韵学特别兴趣小组第9次会议”研讨会的会议记录。
{"title":"Computing and Historical Phonology","authors":"J. Nerbonne, T. M. Ellison, Grzegorz Kondrak","doi":"10.3115/1626516.1626517","DOIUrl":"https://doi.org/10.3115/1626516.1626517","url":null,"abstract":"We introduce the proceedings from the workshop 'Computing and Historical Phonology: 9th Meeting of the ACL Special Interest Group for Computational Morphology and Phonology'.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130703023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper discusses the reconstruction of the Elamite language's phonology from its orthography using the Gradual Learning Algorithm, which was re-purposed to "learn" underlying phonological forms from surface orthography. Practical issues are raised regarding the difficulty of mapping between orthography and phonology, and Optimality Theory's neglected Lexicon Optimization module is highlighted.
{"title":"Phonological Reconstruction of a Dead Language Using the Gradual Learning Algorithm","authors":"Eric Smith","doi":"10.3115/1626516.1626524","DOIUrl":"https://doi.org/10.3115/1626516.1626524","url":null,"abstract":"This paper discusses the reconstruction of the Elamite language's phonology from its orthography using the Gradual Learning Algorithm, which was re-purposed to \"learn\" underlying phonological forms from surface orthography. Practical issues are raised regarding the difficulty of mapping between orthography and phonology, and Optimality Theory's neglected Lexicon Optimization module is highlighted.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134544977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the supply of 8 closely interpreted dialectometrical maps, this paper analyses the linguistic change of the geolinguistic deep structures in Northern France (Domaine d'Oil) between 1300 and 1900. As a matter of fact, the result will show -- with one exception -- the great stability of these deep structures.
{"title":"On the Geolinguistic Change in Northern France between 1300 and 1900: A Dialectometrical Inquiry","authors":"H. Goebl","doi":"10.3115/1626516.1626526","DOIUrl":"https://doi.org/10.3115/1626516.1626526","url":null,"abstract":"With the supply of 8 closely interpreted dialectometrical maps, this paper analyses the linguistic change of the geolinguistic deep structures in Northern France (Domaine d'Oil) between 1300 and 1900. As a matter of fact, the result will show -- with one exception -- the great stability of these deep structures.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128611374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The verb inflections of Bengali underwent a series of phonological change between 10th and 18th centuries, which gave rise to several modern dialects of the language. In this paper, we offer a functional explanation for this change by quantifying the functional pressures of ease of articulation, perceptual contrast and learnability through objective functions or constraints, or both. The multi-objective and multi-constraint optimization problem has been solved through genetic algorithm, whereby we have observed the emergence of Pareto-optimal dialects in the system that closely resemble some of the real ones.
{"title":"Evolution, Optimization, and Language Change: The Case of Bengali Verb Inflections","authors":"M. Choudhury, Vaibhav Jalan, S. Sarkar, A. Basu","doi":"10.3115/1626516.1626525","DOIUrl":"https://doi.org/10.3115/1626516.1626525","url":null,"abstract":"The verb inflections of Bengali underwent a series of phonological change between 10th and 18th centuries, which gave rise to several modern dialects of the language. In this paper, we offer a functional explanation for this change by quantifying the functional pressures of ease of articulation, perceptual contrast and learnability through objective functions or constraints, or both. The multi-objective and multi-constraint optimization problem has been solved through genetic algorithm, whereby we have observed the emergence of Pareto-optimal dialects in the system that closely resemble some of the real ones.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130075537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}