首页 > 最新文献

Int. J. Comput. Linguistics Chin. Lang. Process.最新文献

英文 中文
Characteristics of Independent Claim: A Corpus-Linguistic Approach to Contemporary English Patents 独立权利要求的特征:当代英语专利的语料库语言学研究
Pub Date : 2011-12-01 DOI: 10.30019/IJCLCLP.201112.0005
D. Lin, Shelley Ching-Yu Hsieh
This paper presents a corpus-driven linguistic approach to embodiment in modern patent language as a contribution to the growing needs in intellectual property rights. While there is work that appears to fill a niche in English for Specific Purposes (ESP), the present study suggests that a statistical retrieval approach is necessary for compiling a patent technical word list to expand learner vocabulary size. Since a significant percentage of technical vocabulary appears within the range of independent claim among claim lexis, this study examines the essential features to show how it was characterized with respect to the linguistic specificity of patent style. It is further demonstrated how the proposed approach to the term independent claim contained in the patent specification is reliable for patent application on an international level. For example, clausal types that specify how clauses are used in U.S. patent documents under co-occurrence relations are potential for patent writing, while verb-noun collocations allow learners to grip hidden semantic prosodic associations. In short, the research content and statistical investigations of our approach highlight the pedagogical value of Patent English for ESP teachers, applied linguists, and the development of interdisciplinary research.
本文提出了一种语料库驱动的语言方法来体现现代专利语言,以满足日益增长的知识产权需求。虽然有一些工作似乎填补了特殊用途英语(ESP)的空白,但本研究表明,统计检索方法对于编制专利技术词汇表以扩大学习者的词汇量是必要的。由于相当大比例的技术词汇出现在权利要求词汇的独立权利要求范围内,本研究考察了基本特征,以显示专利风格的语言特异性是如何表征的。进一步证明了专利说明书中包含的术语独立权利要求的拟议方法对于国际上的专利申请是如何可靠的。例如,指定在共现关系下如何在美国专利文件中使用分句的小句类型对于专利写作是潜在的,而动词-名词搭配可以让学习者掌握隐藏的语义韵律关联。总之,我们的研究内容和统计调查突出了专利英语对ESP教师、应用语言学家和跨学科研究发展的教学价值。
{"title":"Characteristics of Independent Claim: A Corpus-Linguistic Approach to Contemporary English Patents","authors":"D. Lin, Shelley Ching-Yu Hsieh","doi":"10.30019/IJCLCLP.201112.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201112.0005","url":null,"abstract":"This paper presents a corpus-driven linguistic approach to embodiment in modern patent language as a contribution to the growing needs in intellectual property rights. While there is work that appears to fill a niche in English for Specific Purposes (ESP), the present study suggests that a statistical retrieval approach is necessary for compiling a patent technical word list to expand learner vocabulary size. Since a significant percentage of technical vocabulary appears within the range of independent claim among claim lexis, this study examines the essential features to show how it was characterized with respect to the linguistic specificity of patent style. It is further demonstrated how the proposed approach to the term independent claim contained in the patent specification is reliable for patent application on an international level. For example, clausal types that specify how clauses are used in U.S. patent documents under co-occurrence relations are potential for patent writing, while verb-noun collocations allow learners to grip hidden semantic prosodic associations. In short, the research content and statistical investigations of our approach highlight the pedagogical value of Patent English for ESP teachers, applied linguists, and the development of interdisciplinary research.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123587887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese 语言技术在中国历史研究中的机遇与挑战
Pub Date : 2011-06-01 DOI: 10.30019/IJCLCLP.201106.0003
Chao-Lin Liu, Guantao Jin, Qingfeng Liu, W. Chiu, Yih-Soong Yu
We report applications of language technology to analyzing historical documents in the Database for the Study of Modern Chinese Thoughts and Literature (DSMCTL). We studied two historical issues with the reported techniques: the conceptualization of "huaren" (Chinese people) and the attempt to institute constitutional monarchy in the late Qing dynasty. We also discuss research challenges for supporting sophisticated issues using our experience with DSMCTL, the Database of Government Officials of the Republic of China, and the Dream of the Red Chamber. Advanced techniques and tools for lexical, syntactic, semantic, and pragmatic processing of language information, along with more thorough data collection, are needed to strengthen the collaboration between historians and computer scientists.
本文报道了语言技术在中国现代思想文学研究数据库(DSMCTL)历史文献分析中的应用。我们用报道的方法研究了两个历史问题:“中国人”的概念化和晚清建立君主立宪政体的尝试。我们还利用我们在DSMCTL、中华民国政府官员数据库和红楼梦方面的经验,讨论了支持复杂问题的研究挑战。为了加强历史学家和计算机科学家之间的合作,需要对语言信息进行词汇、句法、语义和语用处理的先进技术和工具,以及更彻底的数据收集。
{"title":"Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese","authors":"Chao-Lin Liu, Guantao Jin, Qingfeng Liu, W. Chiu, Yih-Soong Yu","doi":"10.30019/IJCLCLP.201106.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201106.0003","url":null,"abstract":"We report applications of language technology to analyzing historical documents in the Database for the Study of Modern Chinese Thoughts and Literature (DSMCTL). We studied two historical issues with the reported techniques: the conceptualization of \"huaren\" (Chinese people) and the attempt to institute constitutional monarchy in the late Qing dynasty. We also discuss research challenges for supporting sophisticated issues using our experience with DSMCTL, the Database of Government Officials of the Republic of China, and the Dream of the Red Chamber. Advanced techniques and tools for lexical, syntactic, semantic, and pragmatic processing of language information, along with more thorough data collection, are needed to strengthen the collaboration between historians and computer scientists.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115548278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Performance Evaluation of Speaker-Identification Systems for Singing Voice Data 歌唱语音数据的说话人识别系统性能评价
Pub Date : 2011-06-01 DOI: 10.30019/IJCLCLP.201106.0001
Wei-Ho Tsai, Hsin-Chieh Lee
Automatic speaker-identification (SID) has long been an important research topic. It is aimed at identifying who among a set of enrolled persons spoke a given utterance. This study extends the conventional SID problem to examining if an SID system trained using speech data can identify the singing voices of the enrolled persons. Our experiment found that a standard SID system fails to identify most singing data, due to the significant differences between singing and speaking for a majority of people. In order for an SID system to handle both speech and singing data, we examine the feasibility of using model-adaptation strategy to enhance the generalization of a standard SID. Our experiments show that a majority of the singing clips can be correctly identified after adapting speech-derived voice models with some singing data.
自动说话人识别(SID)一直是一个重要的研究课题。它的目的是确定在一组登记的人中谁说了给定的话语。本研究将传统的SID问题扩展到检查使用语音数据训练的SID系统是否可以识别入组人员的歌声。我们的实验发现,标准的SID系统无法识别大多数唱歌数据,因为大多数人在唱歌和说话之间存在显著差异。为了使SID系统同时处理语音和歌唱数据,我们研究了使用模型自适应策略来增强标准SID泛化的可行性。我们的实验表明,在使用一些歌唱数据调整语音衍生的语音模型后,可以正确识别大多数歌唱片段。
{"title":"Performance Evaluation of Speaker-Identification Systems for Singing Voice Data","authors":"Wei-Ho Tsai, Hsin-Chieh Lee","doi":"10.30019/IJCLCLP.201106.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201106.0001","url":null,"abstract":"Automatic speaker-identification (SID) has long been an important research topic. It is aimed at identifying who among a set of enrolled persons spoke a given utterance. This study extends the conventional SID problem to examining if an SID system trained using speech data can identify the singing voices of the enrolled persons. Our experiment found that a standard SID system fails to identify most singing data, due to the significant differences between singing and speaking for a majority of people. In order for an SID system to handle both speech and singing data, we examine the feasibility of using model-adaptation strategy to enhance the generalization of a standard SID. Our experiments show that a majority of the singing clips can be correctly identified after adapting speech-derived voice models with some singing data.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131403423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Linguistic Features to Predict Readability of Short Essays for Senior High School Students in Taiwan 用语言特征预测台湾高中生短文可读性
Pub Date : 2010-09-01 DOI: 10.30019/IJCLCLP.201009.0003
Wei-Ti Kuo, Chao-Shainn Huang, Chao-Lin Liu
We investigated the problem of classifying short essays used in comprehension tests for senior high school students in Taiwan. The tests were for first and second year students, so the answers included only four categories, each for one semester of the first two years. A random-guess approach would achieve only 25% in accuracy for our problem. We analyzed three publicly available scores for readability, but did not find them directly applicable. By considering a wide array of features at the levels of word, sentence, and essay, we gradually improved the F measure achieved by our classifiers from 0.381 to 0.536.
摘要本研究旨在探讨台湾高中生理解测验中短文的分类问题。这些测试针对的是一年级和二年级的学生,所以答案只包括四类,每一类是头两年的一个学期。对于我们的问题,随机猜测的方法只能达到25%的准确率。我们分析了三个公开可用的可读性分数,但没有发现它们直接适用。通过考虑单词、句子和文章级别的大量特征,我们逐渐将分类器实现的F度量从0.381提高到0.536。
{"title":"Using Linguistic Features to Predict Readability of Short Essays for Senior High School Students in Taiwan","authors":"Wei-Ti Kuo, Chao-Shainn Huang, Chao-Lin Liu","doi":"10.30019/IJCLCLP.201009.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201009.0003","url":null,"abstract":"We investigated the problem of classifying short essays used in comprehension tests for senior high school students in Taiwan. The tests were for first and second year students, so the answers included only four categories, each for one semester of the first two years. A random-guess approach would achieve only 25% in accuracy for our problem. We analyzed three publicly available scores for readability, but did not find them directly applicable. By considering a wide array of features at the levels of word, sentence, and essay, we gradually improved the F measure achieved by our classifiers from 0.381 to 0.536.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116076045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Discovering Correction Rules for Auto Editing 发现自动编辑的更正规则
Pub Date : 2010-09-01 DOI: 10.30019/IJCLCLP.201009.0004
Anta Huang, Tsung-Ting Kuo, Ying-Chun Lai, Shou-de Lin
This paper describes a framework that extracts effective correction rules from a sentence-aligned corpus and shows a practical application: auto-editing using the discovered rules. The framework exploits the methodology of finding the Levenshtein distance between sentences to identify the key parts of the rules and uses the editing corpus to filter, condense, and refine the rules. We have produced the rule candidates of such form, A → B, where A stands for the erroneous pattern and B for the correct pattern.The developed framework is language independent; therefore, it can be applied to other languages. The evaluation of the discovered rules reveals that 67.2% of the top 1500 ranked rules are annotated as correct or mostly correct by experts. Based on the rules, we have developed an online auto-editing system for demonstration at http://ppt.cc/02yY.
本文描述了一个从与句子对齐的语料库中提取有效纠错规则的框架,并展示了一个实际应用:利用发现的规则进行自动编辑。该框架利用寻找句子之间Levenshtein距离的方法来识别规则的关键部分,并使用编辑语料库来过滤、浓缩和精炼规则。我们已经产生了A→B这种形式的候选规则,其中A代表错误模式,B代表正确模式。开发的框架是独立于语言的;因此,它可以应用于其他语言。对发现的规则的评估显示,排名前1500的规则中有67.2%被专家注释为正确或基本正确。根据这些规则,我们开发了一个在线自动编辑系统进行演示,网址是http://ppt.cc/02yY。
{"title":"Discovering Correction Rules for Auto Editing","authors":"Anta Huang, Tsung-Ting Kuo, Ying-Chun Lai, Shou-de Lin","doi":"10.30019/IJCLCLP.201009.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201009.0004","url":null,"abstract":"This paper describes a framework that extracts effective correction rules from a sentence-aligned corpus and shows a practical application: auto-editing using the discovered rules. The framework exploits the methodology of finding the Levenshtein distance between sentences to identify the key parts of the rules and uses the editing corpus to filter, condense, and refine the rules. We have produced the rule candidates of such form, A → B, where A stands for the erroneous pattern and B for the correct pattern.The developed framework is language independent; therefore, it can be applied to other languages. The evaluation of the discovered rules reveals that 67.2% of the top 1500 ranked rules are annotated as correct or mostly correct by experts. Based on the rules, we have developed an online auto-editing system for demonstration at http://ppt.cc/02yY.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116241475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Word Sense Disambiguation Using Multiple Contextual Features 使用多重上下文特征的词义消歧
Pub Date : 2010-09-01 DOI: 10.30019/IJCLCLP.201009.0002
Liang-Chih Yu, Chung-Hsien Wu, Jui-Feng Yeh
Word sense disambiguation (WSD) is a technique used to identify the correct sense of polysemous words, and it is useful for many applications, such as machine translation (MT), lexical substitution, information retrieval (IR), and biomedical applications. In this paper, we propose the use of multiple contextual features, including the predicate-argument structure and named entities, to train two commonly used classifiers, Naive Bayes (NB) and Maximum Entropy (ME), for word sense disambiguation. Experiments are conducted to evaluate the classifiers' performance on the OntoNotes corpus and are compared with classifiers trained using a set of baseline features, such as the bag-of-words, n-grams, and part-of-speech (POS) tags. Experimental results show that incorporating both predicate-argument structure and named entities yields higher classification accuracy for both classifiers than does the use of the baseline features, resulting in accuracy as high as 81.6% and 87.4%, respectively, for NB and ME.
词义消歧(WSD)是一种用于识别多义词正确意义的技术,它在许多应用中都很有用,例如机器翻译(MT)、词汇替换、信息检索(IR)和生物医学应用。在本文中,我们提出使用多个上下文特征,包括谓词-参数结构和命名实体,来训练两种常用的分类器,朴素贝叶斯(NB)和最大熵(ME),用于词义消歧。通过实验来评估分类器在OntoNotes语料库上的性能,并与使用一组基线特征(如词袋、n-gram和词性(POS)标签)训练的分类器进行比较。实验结果表明,与使用基线特征相比,结合谓词参数结构和命名实体对两个分类器产生更高的分类精度,对NB和ME的准确率分别高达81.6%和87.4%。
{"title":"Word Sense Disambiguation Using Multiple Contextual Features","authors":"Liang-Chih Yu, Chung-Hsien Wu, Jui-Feng Yeh","doi":"10.30019/IJCLCLP.201009.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201009.0002","url":null,"abstract":"Word sense disambiguation (WSD) is a technique used to identify the correct sense of polysemous words, and it is useful for many applications, such as machine translation (MT), lexical substitution, information retrieval (IR), and biomedical applications. In this paper, we propose the use of multiple contextual features, including the predicate-argument structure and named entities, to train two commonly used classifiers, Naive Bayes (NB) and Maximum Entropy (ME), for word sense disambiguation. Experiments are conducted to evaluate the classifiers' performance on the OntoNotes corpus and are compared with classifiers trained using a set of baseline features, such as the bag-of-words, n-grams, and part-of-speech (POS) tags. Experimental results show that incorporating both predicate-argument structure and named entities yields higher classification accuracy for both classifiers than does the use of the baseline features, resulting in accuracy as high as 81.6% and 87.4%, respectively, for NB and ME.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122048979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving the Template Generation for Chinese Character Error Detection with Confusion Sets 基于混淆集的汉字错误检测模板生成改进
Pub Date : 2010-06-01 DOI: 10.30019/IJCLCLP.201006.0003
Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, Tsun Ku
In this paper, we propose a system that automatically generates templates for detecting Chinese character errors. We first collect the confusion sets for each high-frequency Chinese character. Error types include pronunciation-related errors and radical-related errors. With the help of the confusion sets, our system generates possible error patterns in context, which will be used as detection templates. Combined with a word segmentation module, our system generates more accurate templates. The experimental results show the precision of performance approaches 95%. Such a system should not only help teachers grade and check student essays, but also effectively help students learn how to write.
本文提出了一种自动生成汉字错误检测模板的系统。我们首先收集每个高频汉字的混淆集。错误类型包括与发音相关的错误和与词根相关的错误。在混淆集的帮助下,我们的系统在上下文中生成可能的错误模式,这些模式将用作检测模板。结合分词模块,我们的系统生成更准确的模板。实验结果表明,该算法的精度接近95%。这样的系统不仅可以帮助教师评分和检查学生的文章,还可以有效地帮助学生学习如何写作。
{"title":"Improving the Template Generation for Chinese Character Error Detection with Confusion Sets","authors":"Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, Tsun Ku","doi":"10.30019/IJCLCLP.201006.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201006.0003","url":null,"abstract":"In this paper, we propose a system that automatically generates templates for detecting Chinese character errors. We first collect the confusion sets for each high-frequency Chinese character. Error types include pronunciation-related errors and radical-related errors. With the help of the confusion sets, our system generates possible error patterns in context, which will be used as detection templates. Combined with a word segmentation module, our system generates more accurate templates. The experimental results show the precision of performance approaches 95%. Such a system should not only help teachers grade and check student essays, but also effectively help students learn how to write.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126696020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Posteriori Individual Word Language Models for Vietnamese Language 越南语的后验个体词语言模型
Pub Date : 2010-06-01 DOI: 10.30019/IJCLCLP.201006.0002
Le Quan Ha, Trần Thị Thanh Vân, Hoang Tien Long, N. H. Tinh, N. Tham, Le Trong Ngoc
It is shown that the enormous improvement in the size of disk storage space in recent years can be used to build individual word-domain statistical language models, one for each significant word of a language that contributes to the context of the text. Each of these word-domain language models is a precise domain model for the relevant significant word; when combined appropriately, they provide a highly specific domain language model for the language following a cache, even a short cache. Our individual word probability and frequency models have been constructed and tested in the Vietnamese and English languages. For English, we employed the Wall Street Journal corpus of 40 million English word tokens; for Vietnamese, we used the QUB corpus of 6.5 million tokens. Our testing methods used a priori and a posteriori approaches. Finally, we explain adjustment of a previously exaggerated prediction of the potential power of a posteriori models. Accurate improvements in perplexity for 14 kinds of individual word language models have been obtained in tests, (i) between 33.9% and 53.34% for Vietnamese and (ii) between 30.78% and 44.5% for English, over a baseline global tri-gram weighted average model. For both languages, the best a posteriori model is the a posteriori weighted frequency model of 44.5% English perplexity improvement and 53.34% Vietnamese perplexity improvement. In addition, five Vietnamese a posteriori models were tested to obtain from 9.9% to 16.8% word-error-rate (WER) reduction over a Katz trigram model by the same Vietnamese speech decoder.
研究表明,近年来磁盘存储空间的巨大改进可以用于构建单独的词域统计语言模型,一个用于语言中有助于文本上下文的每个重要单词。每个词域语言模型都是相关重要词的精确领域模型;如果适当地组合在一起,它们可以为缓存(甚至是短缓存)之后的语言提供高度特定的领域语言模型。我们的单个单词概率和频率模型已经在越南语和英语中构建和测试。对于英语,我们使用了《华尔街日报》的语料库,其中包含4000万个英语单词;对于越南语,我们使用了650万个令牌的QUB语料库。我们的测试方法使用了先验和后验方法。最后,我们解释了对后验模型的潜在能力先前夸大的预测的调整。在测试中,14种单个单词语言模型的困惑度得到了准确的改善,(i)越南语在33.9%至53.34%之间,(ii)英语在30.78%至44.5%之间,高于基线全球三格加权平均模型。对于两种语言,最好的后验模型是英语困惑度改善44.5%和越南语困惑度改善53.34%的后验加权频率模型。此外,我们测试了五个越南语后验模型,在相同的越南语语音解码器的Katz三元组模型上,单词错误率(WER)降低了9.9%至16.8%。
{"title":"A Posteriori Individual Word Language Models for Vietnamese Language","authors":"Le Quan Ha, Trần Thị Thanh Vân, Hoang Tien Long, N. H. Tinh, N. Tham, Le Trong Ngoc","doi":"10.30019/IJCLCLP.201006.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201006.0002","url":null,"abstract":"It is shown that the enormous improvement in the size of disk storage space in recent years can be used to build individual word-domain statistical language models, one for each significant word of a language that contributes to the context of the text. Each of these word-domain language models is a precise domain model for the relevant significant word; when combined appropriately, they provide a highly specific domain language model for the language following a cache, even a short cache. Our individual word probability and frequency models have been constructed and tested in the Vietnamese and English languages. For English, we employed the Wall Street Journal corpus of 40 million English word tokens; for Vietnamese, we used the QUB corpus of 6.5 million tokens. Our testing methods used a priori and a posteriori approaches. Finally, we explain adjustment of a previously exaggerated prediction of the potential power of a posteriori models. Accurate improvements in perplexity for 14 kinds of individual word language models have been obtained in tests, (i) between 33.9% and 53.34% for Vietnamese and (ii) between 30.78% and 44.5% for English, over a baseline global tri-gram weighted average model. For both languages, the best a posteriori model is the a posteriori weighted frequency model of 44.5% English perplexity improvement and 53.34% Vietnamese perplexity improvement. In addition, five Vietnamese a posteriori models were tested to obtain from 9.9% to 16.8% word-error-rate (WER) reduction over a Katz trigram model by the same Vietnamese speech decoder.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116051060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Discrete-cepstrum Based Spectrum-envelope Estimation Scheme and Its Example Application of Voice Transformation 一种基于离散倒频谱的频谱包络估计方法及其在语音变换中的应用
Pub Date : 2009-12-01 DOI: 10.30019/IJCLCLP.200912.0002
H. Gu, Sung-Feng Tsai
Approximating a spectral envelope via regularized discrete cepstrum coefficients has been proposed by previous researchers. In this paper, we study two problems encountered in practice when adopting this approach to estimate the spectral envelope. The first is which spectral peaks should be selected, and the second is which frequency axis scaling function should be adopted. After some efforts of trying and experiments, we propose two feasible solution methods for these two problems. Then, we combine these solution methods with the methods for regularizing and computing discrete cepstrum coefficients to form a spectral-envelope estimation scheme. This scheme has been verified, by measuring spectral-envelope approximation error, as being much better than the original scheme. Furthermore, we have applied this scheme to building a system for voice timbre transformation. The performance of this system demonstrates the effectiveness of the proposed spectral-envelope estimation scheme.
通过正则化离散倒谱系数逼近谱包络的方法已被前人提出。在本文中,我们研究了在实际中采用这种方法估计光谱包络线时遇到的两个问题。首先是选择哪些频谱峰,其次是采用哪种频率轴缩放函数。经过一些尝试和实验,我们针对这两个问题提出了两种可行的解决方法。然后,我们将这些求解方法与离散倒谱系数的正则化和计算方法结合起来,形成了一种频谱包络估计方案。通过测量谱包络近似误差,验证了该方案优于原方案。并将该方案应用于语音音色变换系统的构建。该系统的性能验证了所提出的频谱包络估计方案的有效性。
{"title":"A Discrete-cepstrum Based Spectrum-envelope Estimation Scheme and Its Example Application of Voice Transformation","authors":"H. Gu, Sung-Feng Tsai","doi":"10.30019/IJCLCLP.200912.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200912.0002","url":null,"abstract":"Approximating a spectral envelope via regularized discrete cepstrum coefficients has been proposed by previous researchers. In this paper, we study two problems encountered in practice when adopting this approach to estimate the spectral envelope. The first is which spectral peaks should be selected, and the second is which frequency axis scaling function should be adopted. After some efforts of trying and experiments, we propose two feasible solution methods for these two problems. Then, we combine these solution methods with the methods for regularizing and computing discrete cepstrum coefficients to form a spectral-envelope estimation scheme. This scheme has been verified, by measuring spectral-envelope approximation error, as being much better than the original scheme. Furthermore, we have applied this scheme to building a system for voice timbre transformation. The performance of this system demonstrates the effectiveness of the proposed spectral-envelope estimation scheme.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114177121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Corpus, Lexicon, and Construction: A Quantitative Corpus Approach to Mandarin Possessive Construction 语料库、词汇与结构:汉语所有格结构的数量语料库研究
Pub Date : 2009-09-01 DOI: 10.30019/IJCLCLP.200909.0004
Cheng-Hsien Chen
Taking Mandarin Possessive Construction (MPC) as an example, the present study investigates the relation between lexicon and constructional schemas in a quantitative corpus linguistic approach. We argue that the wide use of raw frequency distribution in traditional corpus linguistic studies may undermine the validity of the results and reduce the possibility for interdisciplinary communication. Furthermore, several methodological issues in traditional corpus linguistics are discussed. To mitigate the impact of these issues, we utilize phylogenic hierarchical clustering to identify semantic classes of the possessor NPs, thereby reducing the subjectivity in categorization that most traditional corpus linguistic studies suffer from. It is hoped that our rigorous endeavor in methodology may have far-reaching implications for theory in usage-based approaches to language and cognition.
本文以汉语所有格结构为例,运用定量语料库语言学的方法研究了词汇与结构图式之间的关系。我们认为,在传统语料库语言学研究中广泛使用原始频率分布可能会破坏结果的有效性,并减少跨学科交流的可能性。此外,本文还讨论了传统语料库语言学中的几个方法论问题。为了减轻这些问题的影响,我们利用系统发育层次聚类来识别所有NPs的语义类别,从而减少大多数传统语料库语言学研究在分类中所遭受的主观性。希望我们在方法论上的严谨努力能够对语言和认知的基于使用的方法的理论产生深远的影响。
{"title":"Corpus, Lexicon, and Construction: A Quantitative Corpus Approach to Mandarin Possessive Construction","authors":"Cheng-Hsien Chen","doi":"10.30019/IJCLCLP.200909.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200909.0004","url":null,"abstract":"Taking Mandarin Possessive Construction (MPC) as an example, the present study investigates the relation between lexicon and constructional schemas in a quantitative corpus linguistic approach. We argue that the wide use of raw frequency distribution in traditional corpus linguistic studies may undermine the validity of the results and reduce the possibility for interdisciplinary communication. Furthermore, several methodological issues in traditional corpus linguistics are discussed. To mitigate the impact of these issues, we utilize phylogenic hierarchical clustering to identify semantic classes of the possessor NPs, thereby reducing the subjectivity in categorization that most traditional corpus linguistic studies suffer from. It is hoped that our rigorous endeavor in methodology may have far-reaching implications for theory in usage-based approaches to language and cognition.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131616859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Int. J. Comput. Linguistics Chin. Lang. Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1