结合聚类和精确语法从行间有光泽文本推断形态策略

Olga Zamaraeva
{"title":"结合聚类和精确语法从行间有光泽文本推断形态策略","authors":"Olga Zamaraeva","doi":"10.18653/v1/W16-2021","DOIUrl":null,"url":null,"abstract":"In this paper I present a k-means clustering approach to inferring morphological position classes (morphotactics) from Interlinear Glossed Text (IGT), data collections available for some endangered and low-resource languages. While the experiment is not restricted to low-resource languages, they are meant to be the targeted domain. Specifically my approach is meant to be for field linguists who do not necessarily know how many position classes there are in the language they work with and what the position classes are, but have the expertise to evaluate different hypotheses. It builds on an existing approach (Wax, 2014), but replaces the core heuristic with a clustering algorithm. The results mainly illustrate two points. First, they are largely negative, which shows that the baseline algorithm (summarized in the paper) uses a very predictive feature to determine whether affixes belong to the same position class, namely edge overlap in the affix graph. At the same time, unlike the baseline method that relies entirely on a single feature, kmeans clustering can account for different features and helps discover more morphological phenomena, e.g. circumfixation. I conclude that unsupervised learning algorithms such as k-means clustering can in principle be used for morphotactics inference, though the algorithm should probably weigh certain features more than others. Most importantly, I conclude that clustering is a promising approach for diverse morphotactics and as such it can facilitate linguistic analysis of field languages.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Inferring Morphotactics from Interlinear Glossed Text: Combining Clustering and Precision Grammars\",\"authors\":\"Olga Zamaraeva\",\"doi\":\"10.18653/v1/W16-2021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper I present a k-means clustering approach to inferring morphological position classes (morphotactics) from Interlinear Glossed Text (IGT), data collections available for some endangered and low-resource languages. While the experiment is not restricted to low-resource languages, they are meant to be the targeted domain. Specifically my approach is meant to be for field linguists who do not necessarily know how many position classes there are in the language they work with and what the position classes are, but have the expertise to evaluate different hypotheses. It builds on an existing approach (Wax, 2014), but replaces the core heuristic with a clustering algorithm. The results mainly illustrate two points. First, they are largely negative, which shows that the baseline algorithm (summarized in the paper) uses a very predictive feature to determine whether affixes belong to the same position class, namely edge overlap in the affix graph. At the same time, unlike the baseline method that relies entirely on a single feature, kmeans clustering can account for different features and helps discover more morphological phenomena, e.g. circumfixation. I conclude that unsupervised learning algorithms such as k-means clustering can in principle be used for morphotactics inference, though the algorithm should probably weigh certain features more than others. Most importantly, I conclude that clustering is a promising approach for diverse morphotactics and as such it can facilitate linguistic analysis of field languages.\",\"PeriodicalId\":186158,\"journal\":{\"name\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W16-2021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Special Interest Group on Computational Morphology and Phonology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W16-2021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

在本文中,我提出了一种k-means聚类方法,从一些濒危和低资源语言的数据收集中推断出线间光滑文本(IGT)的形态位置类(morphotactics)。虽然实验并不局限于低资源语言,但它们是目标领域。具体来说,我的方法是为现场语言学家准备的,他们不一定知道他们使用的语言中有多少个位置类,也不知道位置类是什么,但他们有专业知识来评估不同的假设。它建立在现有方法(Wax, 2014)的基础上,但用聚类算法取代了核心启发式。研究结果主要说明了两点。首先,它们在很大程度上是负的,这表明基线算法(本文总结)使用了一个非常预测性的特征来确定词缀是否属于同一位置类,即词缀图中的边缘重叠。同时,与完全依赖单个特征的基线方法不同,kmeans聚类可以考虑不同的特征,并有助于发现更多的形态现象,例如圆周固定。我的结论是,像k-means聚类这样的无监督学习算法原则上可以用于形态策略推断,尽管算法可能会比其他算法更看重某些特征。最重要的是,我得出结论,聚类是一种很有前途的方法,可以用于多种形态策略,因此它可以促进对领域语言的语言分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Inferring Morphotactics from Interlinear Glossed Text: Combining Clustering and Precision Grammars
In this paper I present a k-means clustering approach to inferring morphological position classes (morphotactics) from Interlinear Glossed Text (IGT), data collections available for some endangered and low-resource languages. While the experiment is not restricted to low-resource languages, they are meant to be the targeted domain. Specifically my approach is meant to be for field linguists who do not necessarily know how many position classes there are in the language they work with and what the position classes are, but have the expertise to evaluate different hypotheses. It builds on an existing approach (Wax, 2014), but replaces the core heuristic with a clustering algorithm. The results mainly illustrate two points. First, they are largely negative, which shows that the baseline algorithm (summarized in the paper) uses a very predictive feature to determine whether affixes belong to the same position class, namely edge overlap in the affix graph. At the same time, unlike the baseline method that relies entirely on a single feature, kmeans clustering can account for different features and helps discover more morphological phenomena, e.g. circumfixation. I conclude that unsupervised learning algorithms such as k-means clustering can in principle be used for morphotactics inference, though the algorithm should probably weigh certain features more than others. Most importantly, I conclude that clustering is a promising approach for diverse morphotactics and as such it can facilitate linguistic analysis of field languages.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness KU-CST at the SIGMORPHON 2020 Task 2 on Unsupervised Morphological Paradigm Completion Linguist vs. Machine: Rapid Development of Finite-State Morphological Grammars Exploring Neural Architectures And Techniques For Typologically Diverse Morphological Inflection SIGMORPHON 2020 Task 0 System Description: ETH Zürich Team
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1