首页 > 最新文献

2013 International Conference on Asian Language Processing最新文献

英文 中文
Tibetan Text Classification Based on the Feature of Position Weight 基于位置权重特征的藏文文本分类
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.63
Hui Cao, Huiqiang Jia
Based on the study of Tibetan characters and grammar, this paper has done research on Tibetan in the text categorization weight algorithm based on the vector space model. Comprehensively considering the position information of Tibetan which presented in the text, the paper has proposed an improved TF-IDF weighting algorithm. In the process, it has adopted χ2 (CHI) statistical methods for features on the Tibetan word document extraction and used the cosine method in Tibetan text similarity calculation to distinguish between similar documents in Tibetan. The Tibetan text classification algorithm with linear separable support vector machine classification of Tibetan texts, and finally compared the TF-IDF algorithm with the improved TF-IDF algorithm in the effects of the Tibetan text classification. Finally, it shows that the improved TF-IDF algorithm has better classification effect.
本文在研究藏文文字和语法的基础上,研究了基于向量空间模型的藏文文本分类权重算法。综合考虑文中藏文的位置信息,提出了一种改进的TF-IDF加权算法。在藏文词文档提取过程中,采用χ2 (CHI)统计方法对特征进行提取,在藏文文本相似度计算中采用余弦方法对藏文相似文档进行区分。将藏文分类算法与线性可分支持向量机对藏文进行分类,最后比较TF-IDF算法与改进TF-IDF算法在藏文分类上的效果。最后表明改进的TF-IDF算法具有更好的分类效果。
{"title":"Tibetan Text Classification Based on the Feature of Position Weight","authors":"Hui Cao, Huiqiang Jia","doi":"10.1109/IALP.2013.63","DOIUrl":"https://doi.org/10.1109/IALP.2013.63","url":null,"abstract":"Based on the study of Tibetan characters and grammar, this paper has done research on Tibetan in the text categorization weight algorithm based on the vector space model. Comprehensively considering the position information of Tibetan which presented in the text, the paper has proposed an improved TF-IDF weighting algorithm. In the process, it has adopted χ2 (CHI) statistical methods for features on the Tibetan word document extraction and used the cosine method in Tibetan text similarity calculation to distinguish between similar documents in Tibetan. The Tibetan text classification algorithm with linear separable support vector machine classification of Tibetan texts, and finally compared the TF-IDF algorithm with the improved TF-IDF algorithm in the effects of the Tibetan text classification. Finally, it shows that the improved TF-IDF algorithm has better classification effect.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116058130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Categorization and Identification of Fragments with Shi Plus Punctuation “施加号”断句的分类与识别
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.16
Guonian Wang, Lin He
Studies on Chinese sentences with shi (ÊÇ) as predicate have been profoundly fruitful from the perspective of syntax, semantics and pragmatics. In a broader sense, however, a large number of sentences with shi functioning as other syntactic roles - adverb, conjunction, auxiliary and even interjection - are practically used, and stand as barriers to natural language processing (NLP) and machine translation (MT). The special fragments consisting of shi plus punctuation are divided into "shi plus comma" and "comma plus shi", which are examined and discussed with the instruments of corpora, illustrations and comparison. Two exceptional fragments are also briefed to improve the precision in computer identification of these shi-plus-punctuation fragments.
以“是”(ÊÇ)为谓语的汉语句子的研究在句法、语义和语用等方面都取得了丰硕的成果。然而,在更广泛的意义上,大量的“shi”充当其他句法角色的句子——副词、连词、助词甚至感叹词——在实际使用中,成为自然语言处理(NLP)和机器翻译(MT)的障碍。将“诗加标点”构成的特殊片段分为“诗加逗号”和“逗号加诗”两类,运用语料库、插图和对比等手段进行考察和探讨。本文还简要介绍了两种特殊的断句,以提高计算机对这些断句的识别精度。
{"title":"Categorization and Identification of Fragments with Shi Plus Punctuation","authors":"Guonian Wang, Lin He","doi":"10.1109/IALP.2013.16","DOIUrl":"https://doi.org/10.1109/IALP.2013.16","url":null,"abstract":"Studies on Chinese sentences with shi (ÊÇ) as predicate have been profoundly fruitful from the perspective of syntax, semantics and pragmatics. In a broader sense, however, a large number of sentences with shi functioning as other syntactic roles - adverb, conjunction, auxiliary and even interjection - are practically used, and stand as barriers to natural language processing (NLP) and machine translation (MT). The special fragments consisting of shi plus punctuation are divided into \"shi plus comma\" and \"comma plus shi\", which are examined and discussed with the instruments of corpora, illustrations and comparison. Two exceptional fragments are also briefed to improve the precision in computer identification of these shi-plus-punctuation fragments.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114282604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dependency Parsing for Traditional Mongolian 传统蒙古语的依赖关系分析
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.55
Xiangdong Su, Guanglai Gao, Xueliang Yan
Dependency parsing has become increasingly popular in natural language processing in recent years. Nevertheless, dependency parsing focused on Tradition Mongolian has not attracted much attention. We investigate it with Maximum Spanning Tree (MST) based model on Traditional Mongolian dependency tree bank (TMDT). This paper briefly introduces Traditional Mongolian along with TMDT, and discusses the details of MST. Much emphasis is placed on the performance comparisons among eight kinds of features and their combinations in order to find a suitable feature representation. Evaluation result shows that the combination of Basic Unigram Features, Basic Bi-gram Features and C-C Sibling Features obtains the best performance. Our work establishes a baseline for dependency parsing of Traditional Mongolian.
近年来,依赖关系分析在自然语言处理中越来越受欢迎。然而,以传统蒙语为中心的依存句法分析却没有引起足够的重视。利用基于最大生成树(MST)的蒙古传统依赖树库(TMDT)模型对其进行了研究。本文简要介绍了传统蒙语和TMDT,并对MST的具体内容进行了讨论。为了找到合适的特征表示,重点比较了八种特征及其组合的性能。评价结果表明,基本单图特征、基本双图特征和C-C同胞特征的组合获得了最好的性能。我们的工作为传统蒙古语的依赖关系分析建立了一个基线。
{"title":"Dependency Parsing for Traditional Mongolian","authors":"Xiangdong Su, Guanglai Gao, Xueliang Yan","doi":"10.1109/IALP.2013.55","DOIUrl":"https://doi.org/10.1109/IALP.2013.55","url":null,"abstract":"Dependency parsing has become increasingly popular in natural language processing in recent years. Nevertheless, dependency parsing focused on Tradition Mongolian has not attracted much attention. We investigate it with Maximum Spanning Tree (MST) based model on Traditional Mongolian dependency tree bank (TMDT). This paper briefly introduces Traditional Mongolian along with TMDT, and discusses the details of MST. Much emphasis is placed on the performance comparisons among eight kinds of features and their combinations in order to find a suitable feature representation. Evaluation result shows that the combination of Basic Unigram Features, Basic Bi-gram Features and C-C Sibling Features obtains the best performance. Our work establishes a baseline for dependency parsing of Traditional Mongolian.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123380028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Empirical Evaluation of Dimensionality Reduction Using Latent Semantic Analysis on Hindi Text 基于潜在语义分析的印地语文本降维效果的实证评价
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.11
Karthik Krishnamurthi, Ravi Kumar Sudi, Vijayapal Reddy Panuganti, Vishnu Vardhan Bulusu
Dimensionality reduction is the process of deriving an approximate representation of a dataset, that can reflect most of the correlations underlying within the dataset. In the context of text processing, dimensionality reduction is used for transforming any text to a precise representation that efficiently identifies the main insights of the original text. LSA(Latent Semantic Analysis) is a technique that is used to find correlations between words and sentences based on the usage of words within the text. This paper addresses the issue of dimensionality reduction in representing relevant data from Hindi text using LSA. An empirical evaluation is performed to find the influence of language complexity and influence of various weighting schemes on dimensionality reduction. The results are presented using the standard measures such as recall, precision and F-score.
降维是导出数据集的近似表示的过程,它可以反映数据集内部的大多数相关性。在文本处理的上下文中,降维用于将任何文本转换为精确的表示,从而有效地识别原始文本的主要见解。LSA(Latent Semantic Analysis,潜在语义分析)是一种基于文本中单词的用法来查找单词和句子之间相关性的技术。本文解决了使用LSA表示印地语文本相关数据时的降维问题。实证分析了语言复杂度和不同权重方案对降维的影响。结果采用召回率、准确率和f分等标准测量方法。
{"title":"An Empirical Evaluation of Dimensionality Reduction Using Latent Semantic Analysis on Hindi Text","authors":"Karthik Krishnamurthi, Ravi Kumar Sudi, Vijayapal Reddy Panuganti, Vishnu Vardhan Bulusu","doi":"10.1109/IALP.2013.11","DOIUrl":"https://doi.org/10.1109/IALP.2013.11","url":null,"abstract":"Dimensionality reduction is the process of deriving an approximate representation of a dataset, that can reflect most of the correlations underlying within the dataset. In the context of text processing, dimensionality reduction is used for transforming any text to a precise representation that efficiently identifies the main insights of the original text. LSA(Latent Semantic Analysis) is a technique that is used to find correlations between words and sentences based on the usage of words within the text. This paper addresses the issue of dimensionality reduction in representing relevant data from Hindi text using LSA. An empirical evaluation is performed to find the influence of language complexity and influence of various weighting schemes on dimensionality reduction. The results are presented using the standard measures such as recall, precision and F-score.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115561340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Varying or Unvarying-Logarithmic Quotient Model of Vowel Formants 元音共振峰的变或不变对数商模型
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.71
Xuewen Zhou
This paper studies relations of F1, F2, and F3 of vowels, spoken at reading speed by 3 speakers of 2 languages (Yi and Mandarin Chinese). The results show that vowel Formants keep stable relation of Logarithmic Quotient (Z value, Z1=log F2/log F1, Z2=log F3/log F2). The ratio of Standard deviation and Average keeps below 3% for most vowels. Varying degree keeps below 3% for different speakers. This paper proves that Logarithmic Quotient is an ideal vowel-normalizing model and has potential applications in speech recognition and speech comparison.
本文研究了两种语言(彝语和普通话)的3位说话者以阅读速度说出的元音F1、F2和F3的关系。结果表明,元音峰保持稳定的对数商关系(Z值,Z1=log F2/log F1, Z2=log F3/log F2)。大多数元音的标准差与平均值的比值保持在3%以下。不同说话者的差异程度保持在3%以下。证明了对数商是一种理想的元音归一化模型,在语音识别和语音比较中具有潜在的应用前景。
{"title":"Varying or Unvarying-Logarithmic Quotient Model of Vowel Formants","authors":"Xuewen Zhou","doi":"10.1109/IALP.2013.71","DOIUrl":"https://doi.org/10.1109/IALP.2013.71","url":null,"abstract":"This paper studies relations of F1, F2, and F3 of vowels, spoken at reading speed by 3 speakers of 2 languages (Yi and Mandarin Chinese). The results show that vowel Formants keep stable relation of Logarithmic Quotient (Z value, Z1=log F2/log F1, Z2=log F3/log F2). The ratio of Standard deviation and Average keeps below 3% for most vowels. Varying degree keeps below 3% for different speakers. This paper proves that Logarithmic Quotient is an ideal vowel-normalizing model and has potential applications in speech recognition and speech comparison.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129976481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the Accuracy of Large Vocabulary Continuous Speech Recognizer Using Dependency Parse Tree and Chomsky Hierarchy in Lattice Rescoring 利用依存解析树和乔姆斯基分层格评分提高大词汇量连续语音识别器的准确率
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.53
Kai Sze Hong, T. Tan, E. Tang
This research work describes our approaches in using dependency parse tree information to derive useful hidden word statistics to improve the baseline system of Malay large vocabulary automatic speech recognition system. The traditional approaches to train language model are mainly based on Chomsky hierarchy type 3 that approximates natural language as regular language. This approach ignores the characteristics of natural language. Our work attempted to overcome these limitations by extending the approach to consider Chomsky hierarchy type 1 and type 2. We extracted the dependency tree based lexical information and incorporate the information into the language model. The second pass lattice rescoring was performed to produce better hypotheses for Malay large vocabulary continuous speech recognition system. The absolute WER reduction was 2.2% and 3.8% for MASS and MASS-NEWS Corpus, respectively.
本研究描述了我们使用依存解析树信息来获得有用的隐藏词统计的方法,以改进马来语大词汇自动语音识别系统的基线系统。传统的训练语言模型的方法主要是基于乔姆斯基层次结构类型3,将自然语言近似为规则语言。这种方法忽略了自然语言的特点。我们的工作试图通过扩展方法来考虑乔姆斯基的1型和2型层次结构来克服这些限制。我们提取了基于词汇信息的依赖树,并将这些信息整合到语言模型中。对马来语大词汇量连续语音识别系统进行二次点阵评分,提出更好的假设。MASS和MASS- news语料库的绝对WER分别降低2.2%和3.8%。
{"title":"Improving the Accuracy of Large Vocabulary Continuous Speech Recognizer Using Dependency Parse Tree and Chomsky Hierarchy in Lattice Rescoring","authors":"Kai Sze Hong, T. Tan, E. Tang","doi":"10.1109/IALP.2013.53","DOIUrl":"https://doi.org/10.1109/IALP.2013.53","url":null,"abstract":"This research work describes our approaches in using dependency parse tree information to derive useful hidden word statistics to improve the baseline system of Malay large vocabulary automatic speech recognition system. The traditional approaches to train language model are mainly based on Chomsky hierarchy type 3 that approximates natural language as regular language. This approach ignores the characteristics of natural language. Our work attempted to overcome these limitations by extending the approach to consider Chomsky hierarchy type 1 and type 2. We extracted the dependency tree based lexical information and incorporate the information into the language model. The second pass lattice rescoring was performed to produce better hypotheses for Malay large vocabulary continuous speech recognition system. The absolute WER reduction was 2.2% and 3.8% for MASS and MASS-NEWS Corpus, respectively.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130616280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Research of the Modern Uyghur Data Analysis Technology 现代维吾尔语数据分析技术研究
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.39
Mengchen Pan, Xiangwei Qi, Weimin Pan
With the development of our society, the languages are also constantly evolving. In order to master the word situation of modern Uyghur language, I regard modern Uyghur language data analysis technology as the study method, the standard Uyghur language textbooks frequency list of elementary and junior high school as the object of study, we can make a study of the word situation survey. In this article, first of all, introduces the theme types, theme source in the using corpus. Secondly, to state the algorithm research of modern Uyghur language data analysis system, Third I describe function of the modern Uyghur language data analysis software and working principle of each module. Forth, I regard the standard Uyghur language textbooks frequency list of elementary and junior high school as the object of study to validate the reliability and validity of frequency range analysis, coverage rate analysis and text number distribution analysis function of data analysis system. We obtained Ideal experimental results after the actual experiment. It provides advanced tools and techniques for the next step of modern Uyghur language further in-depth analysis study.
随着社会的发展,语言也在不断发展。为了掌握现代维吾尔语词汇情况,笔者以现代维吾尔语数据分析技术为研究方法,以中小学标准维吾尔语教材频次表为研究对象,对现代维吾尔语词汇情况进行调查研究。本文首先介绍了使用语料库中的主题类型、主题来源。其次,阐述了现代维吾尔语数据分析系统的算法研究;第三,描述了现代维吾尔语数据分析软件的功能和各个模块的工作原理。第四,以中小学标准维吾尔语教材频次表为研究对象,验证数据分析系统的频次范围分析、覆盖率分析和文本数分布分析功能的信度和有效性。经过实际实验,得到了理想的实验结果。为下一步进一步深入分析研究现代维吾尔语提供了先进的工具和技术。
{"title":"Research of the Modern Uyghur Data Analysis Technology","authors":"Mengchen Pan, Xiangwei Qi, Weimin Pan","doi":"10.1109/IALP.2013.39","DOIUrl":"https://doi.org/10.1109/IALP.2013.39","url":null,"abstract":"With the development of our society, the languages are also constantly evolving. In order to master the word situation of modern Uyghur language, I regard modern Uyghur language data analysis technology as the study method, the standard Uyghur language textbooks frequency list of elementary and junior high school as the object of study, we can make a study of the word situation survey. In this article, first of all, introduces the theme types, theme source in the using corpus. Secondly, to state the algorithm research of modern Uyghur language data analysis system, Third I describe function of the modern Uyghur language data analysis software and working principle of each module. Forth, I regard the standard Uyghur language textbooks frequency list of elementary and junior high school as the object of study to validate the reliability and validity of frequency range analysis, coverage rate analysis and text number distribution analysis function of data analysis system. We obtained Ideal experimental results after the actual experiment. It provides advanced tools and techniques for the next step of modern Uyghur language further in-depth analysis study.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129173166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Research of Modern Uyghur Word Frequency Statistical Technology 现代维吾尔语词频统计技术研究
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.20
Azragul, Nianmei, Yasen Yimin
With the development of our society, the languages are also constantly evolving. Word is the smallest meaningful language composition which able to activity independently, and is also important carrier of knowledge and the basic operation unit in the natural language processing system. Uyghur word frequency statistics technology is the process by computer automatic identification term boundary in the texts. It is the most important pretreatment of information processing technology. However, there is no a really mature Uighur word frequency statistics system, which became one of the bottlenecks that hampered the development of information processing in Uighur language seriously at present. This paper discusses the idea and algorithms of the Uyghur word frequency statistics system in detail. Secondly introduces functional design process of the word frequency statistics system. Third I describe methods and techniques of this system. Finally it states statement of the test results.
随着社会的发展,语言也在不断发展。词是能够独立活动的最小的有意义的语言组成,是自然语言处理系统中知识的重要载体和基本操作单元。维吾尔语词频统计技术是通过计算机自动识别文本中的词边界的过程。预处理是信息处理中最重要的预处理技术。然而,目前还没有一个真正成熟的维吾尔语词频统计系统,这已经成为制约维吾尔语信息处理发展的瓶颈之一。本文详细讨论了维吾尔语词频统计系统的思想和算法。其次介绍了词频统计系统的功能设计过程。第三,阐述了本系统的实现方法和技术。最后对试验结果进行了说明。
{"title":"Research of Modern Uyghur Word Frequency Statistical Technology","authors":"Azragul, Nianmei, Yasen Yimin","doi":"10.1109/IALP.2013.20","DOIUrl":"https://doi.org/10.1109/IALP.2013.20","url":null,"abstract":"With the development of our society, the languages are also constantly evolving. Word is the smallest meaningful language composition which able to activity independently, and is also important carrier of knowledge and the basic operation unit in the natural language processing system. Uyghur word frequency statistics technology is the process by computer automatic identification term boundary in the texts. It is the most important pretreatment of information processing technology. However, there is no a really mature Uighur word frequency statistics system, which became one of the bottlenecks that hampered the development of information processing in Uighur language seriously at present. This paper discusses the idea and algorithms of the Uyghur word frequency statistics system in detail. Secondly introduces functional design process of the word frequency statistics system. Third I describe methods and techniques of this system. Finally it states statement of the test results.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126270574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recognizing Chinese Elementary Discourse Unit on Comma 汉语逗号基本语篇单元识别
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.8
Shengqin Xu, Peifeng Li
Element discourse unit (EDU) recognition is the primary task of discourse analysis. Chinese punctuation is viewed as a delimiter of elementary discourse units in Chinese. In this paper, we consider Chinese comma to be the boundary of the discourse units and also to anchor discourse relations between units separated by comma. We divide it into seven major types based on syntactic patterns and propose three different machine learning methods to automatically disambiguate the type of Chinese comma. The experimental results on Chinese Tree bank 6.0 show that our method outperforms the baseline.
要素语篇单元识别是语篇分析的首要任务。汉语标点符号是汉语基本语篇单位的分隔符。在本文中,我们认为汉语逗号是语篇单位的边界,并锚定以逗号分隔的语篇单位之间的语篇关系。我们根据句法模式将其分为七种主要类型,并提出了三种不同的机器学习方法来自动消除汉语逗号类型的歧义。在Chinese Tree bank 6.0上的实验结果表明,我们的方法优于基线。
{"title":"Recognizing Chinese Elementary Discourse Unit on Comma","authors":"Shengqin Xu, Peifeng Li","doi":"10.1109/IALP.2013.8","DOIUrl":"https://doi.org/10.1109/IALP.2013.8","url":null,"abstract":"Element discourse unit (EDU) recognition is the primary task of discourse analysis. Chinese punctuation is viewed as a delimiter of elementary discourse units in Chinese. In this paper, we consider Chinese comma to be the boundary of the discourse units and also to anchor discourse relations between units separated by comma. We divide it into seven major types based on syntactic patterns and propose three different machine learning methods to automatically disambiguate the type of Chinese comma. The experimental results on Chinese Tree bank 6.0 show that our method outperforms the baseline.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121118535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Findings and Considerations in Active Learning Based Framework for Resource-Poor SMT 资源贫乏的SMT基于主动学习框架的发现与思考
Pub Date : 2013-08-17 DOI: 10.1109/IALP.2013.28
Jinhua Du, Meng Zhang
Active learning (AL) for resource-poor SMT is an efficient and feasible way to acquire a number of high-quality parallel data to improve translation quality. This paper firstly studies two mainstream sentence selection algorithms that are Geom-phrase and Geom n-gram, and then proposes a sentence perplexity based selection method. Some important findings, such as the impact of sentence length on the AL performance, are observed in the comparison experiments conducted on Chinese-English NIST data. Accordingly, a preprocessing strategy is presented to filter the original monolingual corpus for the purpose of obtaining higher-information sentences. Experimental results on preprocessed data show that the the performance of three selection algorithms is significantly improved compared to the results on the original data.
对于资源贫乏的SMT,主动学习是获取大量高质量并行数据以提高翻译质量的一种有效可行的方法。本文首先研究了geomo -phrase和Geom n-gram两种主流的句子选择算法,然后提出了一种基于句子困惑度的句子选择方法。在对汉英NIST数据进行的对比实验中,我们观察到一些重要的发现,如句子长度对人工智能性能的影响。在此基础上,提出了一种预处理策略,对原始单语语料库进行过滤,以获得高信息句子。在预处理数据上的实验结果表明,三种选择算法的性能与在原始数据上的结果相比有显著提高。
{"title":"Findings and Considerations in Active Learning Based Framework for Resource-Poor SMT","authors":"Jinhua Du, Meng Zhang","doi":"10.1109/IALP.2013.28","DOIUrl":"https://doi.org/10.1109/IALP.2013.28","url":null,"abstract":"Active learning (AL) for resource-poor SMT is an efficient and feasible way to acquire a number of high-quality parallel data to improve translation quality. This paper firstly studies two mainstream sentence selection algorithms that are Geom-phrase and Geom n-gram, and then proposes a sentence perplexity based selection method. Some important findings, such as the impact of sentence length on the AL performance, are observed in the comparison experiments conducted on Chinese-English NIST data. Accordingly, a preprocessing strategy is presented to filter the original monolingual corpus for the purpose of obtaining higher-information sentences. Experimental results on preprocessed data show that the the performance of three selection algorithms is significantly improved compared to the results on the original data.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"49 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133185567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2013 International Conference on Asian Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1