This paper describes a variety of non-parametric Bayesian models of word segmentation based on Adaptor Grammars that model different aspects of the input and incorporate different kinds of prior knowledge, and applies them to the Bantu language Sesotho. While we find overall word segmentation accuracies lower than these models achieve on English, we also find some interesting differences in which factors contribute to better word segmentation. Specifically, we found little improvement to word segmentation accuracy when we modeled contextual dependencies, while modeling morphological structure did improve segmentation accuracy.
{"title":"Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars","authors":"Mark Johnson","doi":"10.3115/1626324.1626328","DOIUrl":"https://doi.org/10.3115/1626324.1626328","url":null,"abstract":"This paper describes a variety of non-parametric Bayesian models of word segmentation based on Adaptor Grammars that model different aspects of the input and incorporate different kinds of prior knowledge, and applies them to the Bantu language Sesotho. While we find overall word segmentation accuracies lower than these models achieve on English, we also find some interesting differences in which factors contribute to better word segmentation. Specifically, we found little improvement to word segmentation accuracy when we modeled contextual dependencies, while modeling morphological structure did improve segmentation accuracy.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130084804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Monson, A. Lavie, J. Carbonell, Lori S. Levin
This paper describes and evaluates a modification to the segmentation model used in the unsupervised morphology induction system, ParaMor. Our improved segmentation model permits multiple morpheme boundaries in a single word. To prepare ParaMor to effectively apply the new agglutinative segmentation model, two heuristics improve ParaMor's precision. These precision-enhancing heuristics are adaptations of those used in other unsupervised morphology induction systems, including work by Hafer and Weiss (1974) and Goldsmith (2006). By reformulating the segmentation model used in ParaMor, we significantly improve ParaMor's performance in all language tracks and in both the linguistic evaluation as well as in the task based information retrieval (IR) evaluation of the peer operated competition Morpho Challenge 2007. ParaMor's improved morpheme recall in the linguistic evaluations of German, Finnish, and Turkish is higher than that of any system which competed in the Challenge. In the three languages of the IR evaluation, our enhanced ParaMor significantly outperforms, at average precision over newswire queries, a morphologically naive baseline; scoring just behind the leading system from Morpho Challenge 2007 in English and ahead of the first place system in German.
{"title":"Evaluating an Agglutinative Segmentation Model for ParaMor","authors":"Christian Monson, A. Lavie, J. Carbonell, Lori S. Levin","doi":"10.3115/1626324.1626332","DOIUrl":"https://doi.org/10.3115/1626324.1626332","url":null,"abstract":"This paper describes and evaluates a modification to the segmentation model used in the unsupervised morphology induction system, ParaMor. Our improved segmentation model permits multiple morpheme boundaries in a single word. To prepare ParaMor to effectively apply the new agglutinative segmentation model, two heuristics improve ParaMor's precision. These precision-enhancing heuristics are adaptations of those used in other unsupervised morphology induction systems, including work by Hafer and Weiss (1974) and Goldsmith (2006). By reformulating the segmentation model used in ParaMor, we significantly improve ParaMor's performance in all language tracks and in both the linguistic evaluation as well as in the task based information retrieval (IR) evaluation of the peer operated competition Morpho Challenge 2007. ParaMor's improved morpheme recall in the linguistic evaluations of German, Finnish, and Turkish is higher than that of any system which competed in the Challenge. In the three languages of the IR evaluation, our enhanced ParaMor significantly outperforms, at average precision over newswire queries, a morphologically naive baseline; scoring just behind the leading system from Morpho Challenge 2007 in English and ahead of the first place system in German.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127872526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A stochastic approach to learning phonology. The model presented captures 7--15% more phonologically plausible underlying forms than a simple majority solution, because it prefers "pure" alternations. It could be useful in cases where an approximate solution is needed, or as a seed for more complex models. A similar process could be involved in some stages of child language acquisition; in particular, early learning of phonotactics.
{"title":"A Bayesian Model of Natural Language Phonology: Generating Alternations from Underlying Forms","authors":"David Ellis","doi":"10.3115/1626324.1626327","DOIUrl":"https://doi.org/10.3115/1626324.1626327","url":null,"abstract":"A stochastic approach to learning phonology. The model presented captures 7--15% more phonologically plausible underlying forms than a simple majority solution, because it prefers \"pure\" alternations. It could be useful in cases where an approximate solution is needed, or as a seed for more complex models. A similar process could be involved in some stages of child language acquisition; in particular, early learning of phonotactics.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132461206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two analyses of Maori passives and gerunds have been debated in the literature. Both assume that the thematic consonants in these forms are unpredictable. This paper reports on three computational experiments designed to test whether this assumption is sound. The results suggest that thematic consonants are predictable from the phonotactic probabilities of their active counterparts. This study has potential implications for allomorphy in other Polynesian languages. It also exemplifies the benefits of using computational methods in linguistic analyses.
{"title":"Phonotactic Probability and the Maori Passive: A Computational Approach","authors":"Oiwi Parker Jones","doi":"10.3115/1626324.1626331","DOIUrl":"https://doi.org/10.3115/1626324.1626331","url":null,"abstract":"Two analyses of Maori passives and gerunds have been debated in the literature. Both assume that the thematic consonants in these forms are unpredictable. This paper reports on three computational experiments designed to test whether this assumption is sound. The results suggest that thematic consonants are predictable from the phonotactic probabilities of their active counterparts. This study has potential implications for allomorphy in other Polynesian languages. It also exemplifies the benefits of using computational methods in linguistic analyses.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131557409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper is an analysis of the claim that a universal ban on certain ('anti-markedness') grammars is necessary in order to explain their non-occurrence in the languages of the world. To assess the validity of this hypothesis I examine the implications of one sound change (a > ə) for learning in a specific phonological domain (stress assignment), making explicit assumptions about the type of data that results, and the learning function that computes over that data. The preliminary conclusion is that restrictions on possible end-point languages are unneeded, and that the most likely outcome of change is a lexicon that is inconsistent with respect to a single generating rule.
{"title":"Bayesian Learning over Conflicting Data: Predictions for Language Change","authors":"Rebecca L Morley","doi":"10.3115/1626324.1626326","DOIUrl":"https://doi.org/10.3115/1626324.1626326","url":null,"abstract":"This paper is an analysis of the claim that a universal ban on certain ('anti-markedness') grammars is necessary in order to explain their non-occurrence in the languages of the world. To assess the validity of this hypothesis I examine the implications of one sound change (a > ə) for learning in a specific phonological domain (stress assignment), making explicit assumptions about the type of data that results, and the learning function that computes over that data. The preliminary conclusion is that restrictions on possible end-point languages are unneeded, and that the most likely outcome of change is a lexicon that is inconsistent with respect to a single generating rule.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114750570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phylogenetic analyses of languages need to explicitly address whether the languages under consideration are related to each other at all. Recently developed permutation tests allow this question to be explored by testing whether words in one set of languages are significantly more similar to those in another set of languages when paired up by semantics than when paired up at random. Seven different phonetic similarity metrics are implemented and evaluated on their effectiveness within such multilateral comparison systems when deployed to detect genetic relations among the Indo-European and Uralic language families.
{"title":"Word Similarity Metrics and Multilateral Comparison","authors":"Brett Kessler","doi":"10.3115/1626516.1626518","DOIUrl":"https://doi.org/10.3115/1626516.1626518","url":null,"abstract":"Phylogenetic analyses of languages need to explicitly address whether the languages under consideration are related to each other at all. Recently developed permutation tests allow this question to be explored by testing whether words in one set of languages are significantly more similar to those in another set of languages when paired up by semantics than when paired up at random. Seven different phonetic similarity metrics are implemented and evaluated on their effectiveness within such multilateral comparison systems when deployed to detect genetic relations among the Indo-European and Uralic language families.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pair Hidden Markov Models (PairHMMs) are trained to align the pronunciation transcriptions of a large contemporary collection of Dutch dialect material, the Goeman-Taeldeman-Van Reenen-Project (GTRP, collected 1980--1995). We focus on the question of how to incorporate information about sound segment distances to improve sequence distance measures for use in dialect comparison. PairHMMs induce segment distances via expectation maximisation (EM). Our analysis uses a phonologically comparable subset of 562 items for all 424 localities in the Netherlands. We evaluate the work first via comparison to analyses obtained using the Levenshtein distance on the same dataset and second, by comparing the quality of the induced vowel distances to acoustic differences.
对隐马尔可夫模型(pairhmm)进行训练,以对齐大量当代荷兰方言材料的发音转录,goeman - taeldemand - van Reenen-Project (GTRP,收集1980- 1995)。我们关注的问题是如何结合音段距离的信息来改进方言比较中使用的序列距离测量。pairhmm通过期望最大化(EM)来诱导区段距离。我们的分析使用了荷兰所有424个地区的562个项目的语音可比子集。我们首先通过与同一数据集上使用Levenshtein距离获得的分析结果进行比较,然后通过比较诱导元音距离与声学差异的质量来评估工作。
{"title":"Inducing Sound Segment Differences Using Pair Hidden Markov Models","authors":"Martijn Wieling, Therese Leinonen, J. Nerbonne","doi":"10.3115/1626516.1626523","DOIUrl":"https://doi.org/10.3115/1626516.1626523","url":null,"abstract":"Pair Hidden Markov Models (PairHMMs) are trained to align the pronunciation transcriptions of a large contemporary collection of Dutch dialect material, the Goeman-Taeldeman-Van Reenen-Project (GTRP, collected 1980--1995). We focus on the question of how to incorporate information about sound segment distances to improve sequence distance measures for use in dialect comparison. PairHMMs induce segment distances via expectation maximisation (EM). Our analysis uses a phonologically comparable subset of 562 items for all 424 localities in the Netherlands. We evaluate the work first via comparison to analyses obtained using the Levenshtein distance on the same dataset and second, by comparing the quality of the induced vowel distances to acoustic differences.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124476235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to handcrafted word lists. Can we instead use corpus based measures for comparative study of languages? In this paper we try to answer this question. We use three corpus based measures and present the results obtained from them and show how these results relate to linguistic and historical knowledge. We argue that the answer is yes and that such studies can provide or validate linguistic and computational insights.
{"title":"Can Corpus Based Measures be Used for Comparative Study of Languages?","authors":"Anil Kumar Singh, H. Surana","doi":"10.3115/1626516.1626522","DOIUrl":"https://doi.org/10.3115/1626516.1626522","url":null,"abstract":"Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to handcrafted word lists. Can we instead use corpus based measures for comparative study of languages? In this paper we try to answer this question. We use three corpus based measures and present the results obtained from them and show how these results relate to linguistic and historical knowledge. We argue that the answer is yes and that such studies can provide or validate linguistic and computational insights.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128094386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes the development and use of an interface for visually evaluating distance measures. The combination of multidimensional scaling plots, histograms and tables allows for different stages of overview and detail. The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy full text search engine and uses distance measures for historical text document retrieval. This engine should provide easier text access for experts as well as interested amateurs.
{"title":"Visualizing the Evaluation of Distance Measures","authors":"T. Pilz, Axel Philipsenburg, W. Luther","doi":"10.3115/1626516.1626527","DOIUrl":"https://doi.org/10.3115/1626516.1626527","url":null,"abstract":"This paper describes the development and use of an interface for visually evaluating distance measures. The combination of multidimensional scaling plots, histograms and tables allows for different stages of overview and detail. The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy full text search engine and uses distance measures for historical text document retrieval. This engine should provide easier text access for experts as well as interested amateurs.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114075386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a Bayesian approach to comparing languages: identifying cognates and the regular correspondences that compose them. A simple model of language is extended to include these notions in an account of parent languages. An expression is developed for the posterior probability of child language forms given a parent language. Bayes' Theorem offers a schema for evaluating choices of cognates and correspondences to explain semantically matched data. An implementation optimising this value with gradient descent is shown to distinguish cognates from non-cognates in data from Polish and Russian.
{"title":"Bayesian Identification of Cognates and Correspondences","authors":"T. M. Ellison","doi":"10.3115/1626516.1626519","DOIUrl":"https://doi.org/10.3115/1626516.1626519","url":null,"abstract":"This paper presents a Bayesian approach to comparing languages: identifying cognates and the regular correspondences that compose them. A simple model of language is extended to include these notions in an account of parent languages. An expression is developed for the posterior probability of child language forms given a parent language. Bayes' Theorem offers a schema for evaluating choices of cognates and correspondences to explain semantically matched data. An implementation optimising this value with gradient descent is shown to distinguish cognates from non-cognates in data from Polish and Russian.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123031061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}