A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)

2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems Pub Date : 2012-11-08 DOI:10.1109/KICSS.2012.33

Nongnuch Ketui, T. Theeramunkong, C. Onsuwan

{"title":"A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)","authors":"Nongnuch Ketui, T. Theeramunkong, C. Onsuwan","doi":"10.1109/KICSS.2012.33","DOIUrl":null,"url":null,"abstract":"Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.","PeriodicalId":309736,"journal":{"name":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KICSS.2012.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于规则的泰语初级语篇单元分割方法

泰语是一种没有词和句子边界的语言，由于其高度的词性歧义和连续动词成分，发现语篇单位并不是一项简单的任务。本文介绍了泰语基本语篇单元的定义、语篇单元分割的语法规则和基于最长匹配的图表解析器。T-EDU定义用于构造一组上下文无关语法(CFG)规则。结果，从1,340个t - edu中构建了446条CFG规则，这些t - edu是从NE和pos标记的语料库Thai-NEST中提取的。这些t - edu由两名语言学家进行评估，kappa得分为0.68。另外，应用了两个级别的评估，一个是在预先分组文本的安排情况下进行的，而另一个是在使用原始运行文本进行测试的正常情况下执行的。通过为每一个T-EDU实例指定一个语法规则，当测试语料库和训练语料库相同时，在封闭环境下可以达到100%的完美召回率，但分块文本和运行文本的召回率分别约为36.16%和31.69%。对于3倍交叉验证的开放测试，召回率约为67%，而精度仅为25-28%。为了提高精度分数，采用了两种备选策略，即从左到右最长匹配(L2R-LM)和最大最长匹配(M-LM)。结果表明，在近距离测试中，L2R-LM和M-LM可以将运行文本的准确率分别提高到93.97%和94.03%。然而，召回率略有下降，分别为94.18%和92.91%。对于开放测试中的运行文本，L2R-LM和M-LM的f分数分别提高到57.70%和54.14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems

自引率

0.00%

发文量