首页 > 最新文献

Special Interest Group on Computational Morphology and Phonology Workshop最新文献

英文 中文
Dynamic Correspondences: An Object-Oriented Approach to Tracking Sound Reconstructions 动态对应:跟踪声音重建的面向对象方法
Pub Date : 2007-06-28 DOI: 10.3115/1626516.1626532
Tyler Peterson, Gessiane Picanco
This paper reports the results of a research project that experiments with crosstabulation in aiding phonemic reconstruction. Data from the Tupi stock was used, and three tests were conducted in order to determine the efficacy of this application: the confirmation and challenging of a previously established reconstruction in the family; testing a new reconstruction generated by our model; and testing the upper limit of simultaneous, multiple correspondences across several languages. Our conclusion is that the use of cross tabulations (implemented within a database as pivot tables) offers an innovative and effective tool in comparative study and sound reconstruction.
本文报道了一个研究项目的实验结果,交叉稳定在帮助音位重建。使用了来自Tupi种群的数据,并进行了三次测试,以确定该应用的有效性:确认和挑战先前在家族中建立的重建;测试由我们的模型生成的新重构;同时测试几种语言的多重通信的上限。我们的结论是,交叉表格(在数据库中作为数据透视表实现)的使用为比较研究和健全重建提供了一种创新和有效的工具。
{"title":"Dynamic Correspondences: An Object-Oriented Approach to Tracking Sound Reconstructions","authors":"Tyler Peterson, Gessiane Picanco","doi":"10.3115/1626516.1626532","DOIUrl":"https://doi.org/10.3115/1626516.1626532","url":null,"abstract":"This paper reports the results of a research project that experiments with crosstabulation in aiding phonemic reconstruction. Data from the Tupi stock was used, and three tests were conducted in order to determine the efficacy of this application: the confirmation and challenging of a previously established reconstruction in the family; testing a new reconstruction generated by our model; and testing the upper limit of simultaneous, multiple correspondences across several languages. Our conclusion is that the use of cross tabulations (implemented within a database as pivot tables) offers an innovative and effective tool in comparative study and sound reconstruction.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120954171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Combined Phonetic-Phonological Approach to Estimating Cross-Language Phoneme Similarity in an ASR Environment 一种语音-音系结合的方法估算ASR环境下跨语言音素相似性
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622166
L. Melnar, Chen Liu
This paper presents a fully automated linguistic approach to measuring distance between phonemes across languages. In this approach, a phoneme is represented by a feature matrix where feature categories are fixed, hierarchically related and binary-valued; feature categorization explicitly addresses allophonic variation and feature values are weighted based on their relative prominence derived from lexical frequency measurements. The relative weight of feature values is factored into phonetic distance calculation. Two phonological distances are statistically derived from lexical frequency measurements. The phonetic distance is combined with the phonological distances to produce a single metric that quantifies cross-language phoneme distance. The performances of target-language phoneme HMMs constructed solely with source language HMMs, first selected by the combined phonetic and phonological metric and then by a data-driven, acoustics distance-based method, are compared in context-independent automatic speech recognition (ASR) experiments. Results show that this approach consistently performs equivalently to the acoustics-based approach, confirming its effectiveness in estimating cross-language similarity between phonemes in an ASR environment.
本文提出了一种完全自动化的语言音素距离测量方法。在这种方法中,音素由特征矩阵表示,其中特征类别是固定的,层次相关的,二值的;特征分类明确地解决了音素的变化,特征值根据词汇频率测量得出的相对突出度进行加权。在语音距离计算中考虑了特征值的相对权重。从词频测量中统计得出两个语音距离。语音距离与语音距离相结合,产生一个量化跨语言音素距离的单一度量。在上下文无关的自动语音识别(ASR)实验中,比较了由源语音素hmm单独构建的目标语音素hmm的表现,这些hmm首先由语音和语音组合度量选择,然后由数据驱动的基于声学距离的方法选择。结果表明,该方法的表现与基于声学的方法一致,证实了其在估计ASR环境中音素跨语言相似性方面的有效性。
{"title":"A Combined Phonetic-Phonological Approach to Estimating Cross-Language Phoneme Similarity in an ASR Environment","authors":"L. Melnar, Chen Liu","doi":"10.3115/1622165.1622166","DOIUrl":"https://doi.org/10.3115/1622165.1622166","url":null,"abstract":"This paper presents a fully automated linguistic approach to measuring distance between phonemes across languages. In this approach, a phoneme is represented by a feature matrix where feature categories are fixed, hierarchically related and binary-valued; feature categorization explicitly addresses allophonic variation and feature values are weighted based on their relative prominence derived from lexical frequency measurements. The relative weight of feature values is factored into phonetic distance calculation. Two phonological distances are statistically derived from lexical frequency measurements. The phonetic distance is combined with the phonological distances to produce a single metric that quantifies cross-language phoneme distance. \u0000 \u0000The performances of target-language phoneme HMMs constructed solely with source language HMMs, first selected by the combined phonetic and phonological metric and then by a data-driven, acoustics distance-based method, are compared in context-independent automatic speech recognition (ASR) experiments. Results show that this approach consistently performs equivalently to the acoustics-based approach, confirming its effectiveness in estimating cross-language similarity between phonemes in an ASR environment.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124131951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Richness of the Base and Probabilistic Unsupervised Learning in Optimality Theory 最优性理论中基础的丰富性与概率无监督学习
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622172
G. Jarosz
This paper proposes an unsupervised learning algorithm for Optimality Theoretic grammars, which learns a complete constraint ranking and a lexicon given only unstructured surface forms and morphological relations. The learning algorithm, which is based on the Expectation-Maximization algorithm, gradually maximizes the likelihood of the observed forms by adjusting the parameters of a probabilistic constraint grammar and a probabilistic lexicon. The paper presents the algorithm's results on three constructed language systems with different types of hidden structure: voicing neutralization, stress, and abstract vowels. In all cases the algorithm learns the correct constraint ranking and lexicon. The paper argues that the algorithm's ability to identify correct, restrictive grammars is due in part to its explicit reliance on the Optimality Theoretic notion of Richness of the Base.
本文提出了一种最优性理论语法的无监督学习算法,该算法只学习给定非结构化表面形式和形态关系的完全约束排序和词典。该学习算法基于期望最大化算法,通过调整概率约束语法和概率词汇的参数,逐步使观察到的形式的似然最大化。本文给出了该算法在三种不同隐藏结构类型的语言系统上的结果:语音中和、重音和抽象元音。在所有情况下,算法都学习到正确的约束排序和词典。本文认为,该算法识别正确的限制性语法的能力部分是由于它明确依赖于基础丰富度的最优性理论概念。
{"title":"Richness of the Base and Probabilistic Unsupervised Learning in Optimality Theory","authors":"G. Jarosz","doi":"10.3115/1622165.1622172","DOIUrl":"https://doi.org/10.3115/1622165.1622172","url":null,"abstract":"This paper proposes an unsupervised learning algorithm for Optimality Theoretic grammars, which learns a complete constraint ranking and a lexicon given only unstructured surface forms and morphological relations. The learning algorithm, which is based on the Expectation-Maximization algorithm, gradually maximizes the likelihood of the observed forms by adjusting the parameters of a probabilistic constraint grammar and a probabilistic lexicon. The paper presents the algorithm's results on three constructed language systems with different types of hidden structure: voicing neutralization, stress, and abstract vowels. In all cases the algorithm learns the correct constraint ranking and lexicon. The paper argues that the algorithm's ability to identify correct, restrictive grammars is due in part to its explicit reliance on the Optimality Theoretic notion of Richness of the Base.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131626351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Learning Probabilistic Paradigms for Morphology in a Latent Class Model 潜在类模型中形态学的概率学习范式
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622174
Erwin Chan
This paper introduces the probabilistic paradigm, a probabilistic, declarative model of morphological structure. We describe an algorithm that recursively applies Latent Dirichlet Allocation with an orthogonality constraint to discover morphological paradigms as the latent classes within a suffix-stem matrix. We apply the algorithm to data preprocessed in several different ways, and show that when suffixes are distinguished for part of speech and allomorphs or gender/conjugational variants are merged, the model is able to correctly learn morphological paradigms for English and Spanish. We compare our system with Linguistica (Goldsmith 2001), and discuss the advantages of the probabilistic paradigm over Linguistica's signature representation.
本文介绍了概率范式,即形态结构的概率陈述模型。我们描述了一种递归地应用具有正交性约束的潜狄利克雷分配来发现词根矩阵中作为潜类的形态范式的算法。我们将该算法应用于几种不同方式的预处理数据,结果表明,当词性后缀被区分,异型或性别/共轭变体被合并时,该模型能够正确学习英语和西班牙语的形态范式。我们将我们的系统与Linguistica (Goldsmith 2001)进行了比较,并讨论了概率范式相对于Linguistica签名表示的优势。
{"title":"Learning Probabilistic Paradigms for Morphology in a Latent Class Model","authors":"Erwin Chan","doi":"10.3115/1622165.1622174","DOIUrl":"https://doi.org/10.3115/1622165.1622174","url":null,"abstract":"This paper introduces the probabilistic paradigm, a probabilistic, declarative model of morphological structure. We describe an algorithm that recursively applies Latent Dirichlet Allocation with an orthogonality constraint to discover morphological paradigms as the latent classes within a suffix-stem matrix. We apply the algorithm to data preprocessed in several different ways, and show that when suffixes are distinguished for part of speech and allomorphs or gender/conjugational variants are merged, the model is able to correctly learn morphological paradigms for English and Spanish. We compare our system with Linguistica (Goldsmith 2001), and discuss the advantages of the probabilistic paradigm over Linguistica's signature representation.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124993025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Learning Quantity Insensitive Stress Systems via Local Inference 基于局部推理的学习量不敏感应力系统
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622168
Jeffrey Heinz
This paper presents an unsupervised batch learner for the quantity-insensitive stress systems described in Gordon (2002). Unlike previous stress learning models, the learner presented here is neither cue based (Dresher and Kaye, 1990), nor reliant on a priori Optimality-theoretic constraints (Tesar, 1998). Instead our learner exploits a property called neighborhood-distinctness, which is shared by all of the target patterns. Some consequences of this approach include a natural explanation for the occurrence of binary and ternary rhythmic patterns, the lack of higher n-ary rhythms, and the fact that, in these systems, stress always falls within a certain window of word edges.
本文提出了Gordon(2002)中描述的数量不敏感应力系统的无监督批量学习器。与以前的压力学习模型不同,这里提出的学习者既不是基于线索的(Dresher和Kaye, 1990),也不依赖于先验的最优性理论约束(Tesar, 1998)。相反,我们的学习者利用了一种被称为邻域独特性的特性,所有目标模式都共享这种特性。这种方法的一些结果包括对二元和三元节奏模式的自然解释,缺乏更高的n进节奏,以及在这些系统中,重音总是落在单词边缘的某个窗口内的事实。
{"title":"Learning Quantity Insensitive Stress Systems via Local Inference","authors":"Jeffrey Heinz","doi":"10.3115/1622165.1622168","DOIUrl":"https://doi.org/10.3115/1622165.1622168","url":null,"abstract":"This paper presents an unsupervised batch learner for the quantity-insensitive stress systems described in Gordon (2002). Unlike previous stress learning models, the learner presented here is neither cue based (Dresher and Kaye, 1990), nor reliant on a priori Optimality-theoretic constraints (Tesar, 1998). Instead our learner exploits a property called neighborhood-distinctness, which is shared by all of the target patterns. Some consequences of this approach include a natural explanation for the occurrence of binary and ternary rhythmic patterns, the lack of higher n-ary rhythms, and the fact that, in these systems, stress always falls within a certain window of word edges.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129436788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Improved morpho-phonological sequence processing with constraint satisfaction inference 基于约束满意推理的改进词音序列处理
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622171
Antal van den Bosch, S. Canisius
In performing morpho-phonological sequence processing tasks, such as letter-phoneme conversion or morphological analysis, it is typically not enough to base the output sequence on local decisions that map local-context input windows to single output tokens. We present a global sequence-processing method that repairs inconsistent local decisions. The approach is based on local predictions of overlapping trigrams of output tokens, which open up a space of possible sequences; a data-driven constraint satisfaction inference step then searches for the optimal output sequence. We demonstrate significant improvements in terms of word accuracy on English and Dutch letter-phoneme conversion and morphological segmentation, and we provide qualitative analyses of error types prevented by the constraint satisfaction inference method.
在执行形态-音位序列处理任务时,例如字母-音位转换或形态分析,将输出序列建立在将本地上下文输入窗口映射到单个输出标记的本地决策的基础上通常是不够的。我们提出了一种修复不一致的局部决策的全局序列处理方法。该方法基于对输出标记的重叠三元组的局部预测,这打开了一个可能序列的空间;然后,数据驱动的约束满足推理步骤搜索最优输出序列。我们在英语和荷兰语的字母-音素转换和形态切分方面证明了显著的准确性提高,并对约束满意度推理方法防止的错误类型进行了定性分析。
{"title":"Improved morpho-phonological sequence processing with constraint satisfaction inference","authors":"Antal van den Bosch, S. Canisius","doi":"10.3115/1622165.1622171","DOIUrl":"https://doi.org/10.3115/1622165.1622171","url":null,"abstract":"In performing morpho-phonological sequence processing tasks, such as letter-phoneme conversion or morphological analysis, it is typically not enough to base the output sequence on local decisions that map local-context input windows to single output tokens. We present a global sequence-processing method that repairs inconsistent local decisions. The approach is based on local predictions of overlapping trigrams of output tokens, which open up a space of possible sequences; a data-driven constraint satisfaction inference step then searches for the optimal output sequence. We demonstrate significant improvements in terms of word accuracy on English and Dutch letter-phoneme conversion and morphological segmentation, and we provide qualitative analyses of error types prevented by the constraint satisfaction inference method.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128025693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A Naive Theory of Affixation and an Algorithm for Extraction 一种朴素的词缀理论及提取算法
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622175
H. Hammarström
We present a novel approach to the unsupervised detection of affixes, that is, to extract a set of salient prefixes and suffixes from an unlabeled corpus of a language. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, i.e occur much more often that random segments of the same length, and that 2. words essentially are variable length sequences of random characters, e.g a character should not occur in far too many words than random without a reason, such as being part of a very frequent affix. The affix extraction algorithm uses only information from fluctation of frequencies, runs in linear time, and is free from thresholds and untransparent iterations. We demonstrate the usefulness of the approach with example case studies on typologically distant languages.
我们提出了一种无监督词缀检测的新方法,即从语言的未标记语料库中提取一组显著的前缀和后缀。基础理论没有假设语言是否使用了大量的形态学,是前缀还是后缀,词缀是长还是短。但是它假设1。显著词缀必须频繁出现,即出现的频率比相同长度的随机词缀要高得多。单词本质上是可变长度的随机字符序列,例如,一个字符不应该出现在太多的单词中,而不是没有原因的随机,例如作为一个非常频繁的词缀的一部分。词缀提取算法仅使用频率波动信息,在线性时间内运行,不受阈值和不透明迭代的影响。我们用类型学上遥远语言的案例研究证明了这种方法的有效性。
{"title":"A Naive Theory of Affixation and an Algorithm for Extraction","authors":"H. Hammarström","doi":"10.3115/1622165.1622175","DOIUrl":"https://doi.org/10.3115/1622165.1622175","url":null,"abstract":"We present a novel approach to the unsupervised detection of affixes, that is, to extract a set of salient prefixes and suffixes from an unlabeled corpus of a language. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, i.e occur much more often that random segments of the same length, and that 2. words essentially are variable length sequences of random characters, e.g a character should not occur in far too many words than random without a reason, such as being part of a very frequent affix. The affix extraction algorithm uses only information from fluctation of frequencies, runs in linear time, and is free from thresholds and untransparent iterations. We demonstrate the usefulness of the approach with example case studies on typologically distant languages.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Exploring variant definitions of pointer length in MDL 探索MDL中指针长度的不同定义
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622170
Aris Xanthos, Yu Hu, J. Goldsmith
Within the information-theoretical framework described by (Rissanen, 1989; de Marcken, 1996; Goldsmith, 2001), pointers are used to avoid repetition of phonological material. Work with which we are familiar has assumed that there is only one way in which items could be pointed to. The purpose of this paper is to describe and compare several different methods, each of which satisfies MDL's basic requirements, but which have different consequences for the treatment of linguistic phenomena. In particular, we assess the conditions under which these different ways of pointing yield more compact descriptions of the data, both from a theoretical and an empirical perspective.
在(Rissanen, 1989;de Marcken, 1996;Goldsmith, 2001),指针用于避免语音材料的重复。我们熟悉的工作假设只有一种方式可以指向项。本文的目的是描述和比较几种不同的方法,每种方法都满足MDL的基本要求,但它们对语言现象的处理有不同的影响。特别是,我们从理论和经验的角度评估了这些不同的指向方式产生更紧凑的数据描述的条件。
{"title":"Exploring variant definitions of pointer length in MDL","authors":"Aris Xanthos, Yu Hu, J. Goldsmith","doi":"10.3115/1622165.1622170","DOIUrl":"https://doi.org/10.3115/1622165.1622170","url":null,"abstract":"Within the information-theoretical framework described by (Rissanen, 1989; de Marcken, 1996; Goldsmith, 2001), pointers are used to avoid repetition of phonological material. Work with which we are familiar has assumed that there is only one way in which items could be pointed to. The purpose of this paper is to describe and compare several different methods, each of which satisfies MDL's basic requirements, but which have different consequences for the treatment of linguistic phenomena. In particular, we assess the conditions under which these different ways of pointing yield more compact descriptions of the data, both from a theoretical and an empirical perspective.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"8 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133238141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Morphology Induction from Limited Noisy Data Using Approximate String Matching 基于近似字符串匹配的有限噪声数据形态学诱导
Pub Date : 2006-06-08 DOI: 10.3115/1622165.1622173
Burcu Karagol-Ayan, D. Doermann, A. Weinberg
For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.
对于资源有限的语言,词典可能是为数不多的可用电子资源之一。然而,为了有效地利用词典进行翻译,用户必须能够使用在文本中发现的词形变形变体的词根形式来访问词典。然而,当数据稀疏时,词干提取和数据驱动方法不适合。我们提出了从扫描硬拷贝字典获得的有限噪声数据中发现语素的算法。我们的方法是基于最长公共子串和字符串编辑距离度量的新应用。结果表明,这些算法实际上可以从字典中有限的数据中将单词分割成词根和词缀,并提取词缀。这反过来又允许非母语人士执行多语言任务的应用程序,反应必须迅速,他们的知识是有限的。此外,这种分析可以为其他需要词典的NLP工具提供支持。
{"title":"Morphology Induction from Limited Noisy Data Using Approximate String Matching","authors":"Burcu Karagol-Ayan, D. Doermann, A. Weinberg","doi":"10.3115/1622165.1622173","DOIUrl":"https://doi.org/10.3115/1622165.1622173","url":null,"abstract":"For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121084971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Unsupervised Learning of Morphology for Building Lexicon for a Highly Inflectional Language 高屈折语词法的无监督学习
Pub Date : 2002-07-11 DOI: 10.3115/1118647.1118648
U. Sharma, J. Kalita, R. Das
Words play a crucial role in aspects of natural language understanding such as syntactic and semantic processing. Usually, a natural language understanding system either already knows the words that appear in the text, or is able to automatically learn relevant information about a word upon encountering it. Usually, a capable system---human or machine, knows a subset of the entire vocabulary of a language and morphological rules to determine attributes of words not seen before. Developing a knowledge base of legal words and morphological rules is an important task in computational linguistics. In this paper, we describe initial experiments following an approach based on unsupervised learning of morphology from a text corpus, especially developed for this purpose. It is a method for conveniently creating a dictionary and a morphology rule base, and is, especially suitable for highly inflectional languages like Assamese. Assamese is a major Indian language of the Indic branch of the Indo-European family of languages. It is used by around 15 million people.
单词在自然语言理解中起着至关重要的作用,如句法和语义处理。通常,自然语言理解系统要么已经知道文本中出现的单词,要么能够在遇到单词时自动学习有关单词的相关信息。通常,一个有能力的系统——人类或机器——知道一种语言的整个词汇表的子集和形态规则,以确定以前未见过的单词的属性。建立法律词汇和形态规则知识库是计算语言学的一项重要任务。在本文中,我们描述了一种基于文本语料库中形态学的无监督学习方法的初步实验,这种方法是专门为此目的开发的。这是一种方便地创建字典和词法规则库的方法,特别适用于像阿萨姆语这样高度屈折的语言。阿萨姆语是印欧语系印度语分支的主要印度语言。大约有1500万人使用它。
{"title":"Unsupervised Learning of Morphology for Building Lexicon for a Highly Inflectional Language","authors":"U. Sharma, J. Kalita, R. Das","doi":"10.3115/1118647.1118648","DOIUrl":"https://doi.org/10.3115/1118647.1118648","url":null,"abstract":"Words play a crucial role in aspects of natural language understanding such as syntactic and semantic processing. Usually, a natural language understanding system either already knows the words that appear in the text, or is able to automatically learn relevant information about a word upon encountering it. Usually, a capable system---human or machine, knows a subset of the entire vocabulary of a language and morphological rules to determine attributes of words not seen before. Developing a knowledge base of legal words and morphological rules is an important task in computational linguistics. In this paper, we describe initial experiments following an approach based on unsupervised learning of morphology from a text corpus, especially developed for this purpose. It is a method for conveniently creating a dictionary and a morphology rule base, and is, especially suitable for highly inflectional languages like Assamese. Assamese is a major Indian language of the Indic branch of the Indo-European family of languages. It is used by around 15 million people.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114689907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
期刊
Special Interest Group on Computational Morphology and Phonology Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1