Int. J. Comput. Linguistics Chin. Lang. Process.最新文献

Enriching Cold Start Personalized Language Model Using Social Network Information 利用社会网络信息丰富冷启动个性化语言模型

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2014-06-01 DOI: 10.3115/v1/P14-2100

Yu-Yang Huang, Rui Yan, Tsung-Ting Kuo, Shou-de Lin

We introduce a generalized framework to enrich the personalized language models for cold start users. The cold start problem is solved with content written by friends on social network services. Our framework consists of a mixture language model, whose mixture weights are es- timated with a factor graph. The factor graph is used to incorporate prior knowledge and heuris- tics to identify the most appropriate weights. The intrinsic and extrinsic experiments show significant improvement on cold start users.

我们引入了一个通用的框架来丰富冷启动用户的个性化语言模型。通过朋友在社交网络服务上写的内容来解决冷启动问题。该框架由一个混合语言模型组成，混合语言模型的混合权重用因子图估计。因子图结合了先验知识和启发式方法来确定最合适的权重。内部和外部实验表明，冷启动用户显著改善。

引用次数: 16

TQDL: Integrated Models for Cross-Language Document Retrieval 跨语言文档检索的集成模型

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-12-01 DOI: 10.30019/IJCLCLP.201212.0002

Longyue Wang, Derek F. Wong, Lidia S. Chao

This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.

本文提出了一种集成跨语言信息检索(CLIR)的方法，该方法集成了翻译模型、查询生成模型、文档检索模型和长度过滤模型四个统计模型。给定源语言的某个文档，将其翻译成统计机器翻译模型的目标语言。然后，查询生成模型在文档的翻译版本中选择最相关的单词作为查询。与使用查询检索所有目标文档不同，基于长度的模型可以根据长度信息帮助过滤掉大量不相关的候选文档。最后，通过文档搜索模型对目标语言的剩余文档进行评分，该模型主要计算查询与文档之间的相似度。与传统的基于IBM算法的并行语料库模型不同，我们将CLIR模型分为四个独立的部分，分别处理术语消歧、查询生成和文档检索。此外，TQDL方法还能有效地解决跨语言信息检索中的翻译歧义和查询消歧扩展问题。另一个贡献是长度过滤器，它是根据两种语言之间的长度比例从并行语料库中训练出来的。这不仅可以动态地过滤掉大量无用的文档，从而提高召回值，而且可以在更小的搜索空间内提高效率。因此，精度可以提高，但不能以召回率为代价。为了评估该模型在跨语言文档检索中的检索性能，在不同的设置下进行了大量的实验。首先，欧洲平行语料库是欧洲议会会议记录中11种语言平行文本的集合，用于评估。我们对模型进行了广泛的测试，以解决文本长度参差不齐的情况，其中一些文本在同一主题下可能具有相似的内容，因为难以区分和充分利用资源。通过对不同策略的比较，实验结果显示了该方法的显著性能。通过使用更大的查询大小，精度通常在90%以上。基于长度的滤波器在提高f测度和优化效率方面起着非常重要的作用。这充分说明了该方法的判别能力。这对于统计机器翻译系统的跨语言搜索和并行语料库生成都具有重要意义。在今后的工作中，将对汉语的TQDL系统进行评价，这是一个很大的变化，对CLIR更有意义。

{"title":"TQDL: Integrated Models for Cross-Language Document Retrieval","authors":"Longyue Wang, Derek F. Wong, Lidia S. Chao","doi":"10.30019/IJCLCLP.201212.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201212.0002","url":null,"abstract":"This paper proposed an integrated approach for Cross-Language Information Retrieval (CLIR), which integrated with four statistical models: Translation model, Query generation model, Document retrieval model and Length Filter model. Given a certain document in the source language, it will be translated into the target language of the statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Instead of retrieving all the target documents with the query, the length-based model can help to filter out a large amount of irrelevant candidates according to their length information. Finally, the left documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document.Different from the traditional parallel corpora-based model which relies on IBM algorithm, we divided our CLIR model into four independent parts but all work together to deal with the term disambiguation, query generation and document retrieval. Besides, the TQDL method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are the big issues in Cross-Language Information Retrieval. Another contribution is the length filter, which are trained from a parallel corpus according to the ratio of length between two languages. This can not only improve the recall value due to filtering out lots of useless documents dynamically, but also increase the efficiency in a smaller search space. Therefore, the precision can be improved but not at the cost of recall.In order to evaluate the retrieval performance of the proposed model on cross-languages document retrieval, a number of experiments have been conducted on different settings. Firstly, the Europarl corpus which is the collection of parallel texts in 11 languages from the proceedings of the European Parliament was used for evaluation. And we tested the models extensively to the case that: the lengths of texts are uneven and some of them may have similar contents under the same topic, because it is hard to be distinguished and make full use of the resources.After comparing different strategies, the experimental results show a significant performance of the method. The precision is normally above 90% by using a larger query size. The length-based filter plays a very important role in improving the F-measure and optimizing efficiency.This fully illustrates the discrimination power of the proposed method. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems. In the future work, the TQDL system will be evaluated for Chinese language, which is a big changing and more meaningful to CLIR.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133274913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars 基于特征的词化树邻接语法在机器翻译中的句法错误检测与纠正

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-12-01 DOI: 10.30019/IJCLCLP.201212.0001

Wei-Yun Ma, K. McKeown

Statistical machine translation has made tremendous progress over the past ten years. The output of even the best systems, however, is often ungrammatical because of the lack of sufficient linguistic knowledge. Even when systems incorporate syntax in the translation process, syntactic errors still result. To address this issue, we present a novel approach for detecting and correcting ungrammatical translations. In order to simultaneously detect multiple errors and their corresponding words in a formal framework, we use feature-based lexicalized tree adjoining grammars, where each lexical item is associated with a syntactic elementary tree, in which each node is associated with a set of feature-value pairs to define the lexical item’s syntactic usage. Our syntactic error detection works by checking the feature values of all lexical items within a sentence using a unification framework. In order to simultaneously detect multiple error types and track their corresponding words, we propose a new unification method which allows the unification procedure to continue when unification fails and also to propagate the failure information to relevant words. Once error types and their corresponding words are detected, one is able to correct errors based on a unified consideration of all related words under the same error types. In this paper, we present some simple mechanism to handle part of the detected situations. We use our approach to detect and correct translations of six single statistical machine translation systems. The results show that most of the corrected translations are improved.

统计机器翻译在过去十年中取得了巨大的进步。然而，由于缺乏足够的语言知识，即使是最好的系统的输出也常常是不符合语法的。即使系统在翻译过程中加入了语法，仍然会产生语法错误。为了解决这个问题，我们提出了一种新的方法来检测和纠正不符合语法的翻译。为了在形式化框架中同时检测多个错误及其对应的单词，我们使用基于特征的词汇化树相邻语法，其中每个词汇项与句法基本树相关联，其中每个节点与一组特征值对相关联，以定义词汇项的句法用法。我们的句法错误检测是通过使用统一框架检查句子中所有词法项的特征值来实现的。为了同时检测多种错误类型并跟踪其对应的单词，我们提出了一种新的统一方法，该方法允许在统一失败时继续统一过程，并将失败信息传播到相关单词。一旦检测到错误类型及其对应的单词，就可以在统一考虑相同错误类型下的所有相关单词的基础上纠正错误。在本文中，我们提出了一些简单的机制来处理部分检测到的情况。我们使用我们的方法来检测和纠正六个单一统计机器翻译系统的翻译。结果表明，大部分译文都得到了改进。

{"title":"Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars","authors":"Wei-Yun Ma, K. McKeown","doi":"10.30019/IJCLCLP.201212.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201212.0001","url":null,"abstract":"Statistical machine translation has made tremendous progress over the past ten years. The output of even the best systems, however, is often ungrammatical because of the lack of sufficient linguistic knowledge. Even when systems incorporate syntax in the translation process, syntactic errors still result. To address this issue, we present a novel approach for detecting and correcting ungrammatical translations. In order to simultaneously detect multiple errors and their corresponding words in a formal framework, we use feature-based lexicalized tree adjoining grammars, where each lexical item is associated with a syntactic elementary tree, in which each node is associated with a set of feature-value pairs to define the lexical item’s syntactic usage. Our syntactic error detection works by checking the feature values of all lexical items within a sentence using a unification framework. In order to simultaneously detect multiple error types and track their corresponding words, we propose a new unification method which allows the unification procedure to continue when unification fails and also to propagate the failure information to relevant words. Once error types and their corresponding words are detected, one is able to correct errors based on a unified consideration of all related words under the same error types. In this paper, we present some simple mechanism to handle part of the detected situations. We use our approach to detect and correct translations of six single statistical machine translation systems. The results show that most of the corrected translations are improved.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127184175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Evaluation of TTS Systems in Intelligibility and Comprehension Tasks: a Case Study of HTS-2008 and Multisyn Synthesizers TTS系统在可理解性和理解任务中的评价:以HTS-2008和多同步合成器为例

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-09-01 DOI: 10.30019/IJCLCLP.201209.0005

Yu-Yun Chang

This paper explores the relationship between intelligibility and comprehensibility in speech synthesizers, and it designs an appropriate comprehension task for evaluating the speech synthesizers' comprehensibility. Previous studies have predicted that a speech synthesizer with higher intelligibility will have higher performance in comprehension. Also, since the two most popular speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a control group to the speech synthesizers. The results in the intelligibility test show that natural speech is better than HTS-2008, which, in turn, is much better than the Multisyn system. In the comprehension task, however, all three of the speech systems display minimal differences in the speech comprehension process. This is because the two speech synthesizers have reached the threshold of having enough intelligibility to provide high speech comprehension quality. Therefore, although there is equal comprehensible speech quality between the HTS-2008 and Multisyn systems, the HTS-2008 speech synthesizer is recommended due to its higher intelligibility.

本文探讨了语音合成器的可理解性和可理解性之间的关系，并设计了一个合适的理解任务来评价语音合成器的可理解性。以往的研究预测，语音合成器的可理解性越高，理解能力越强。此外，由于基于hmm和单元选择两种最流行的语音合成方法，本研究试图比较HTS-2008(基于hmm)和Multisyn(单元选择)语音合成器在应用中哪个性能更好。实验中使用自然语音作为语音合成器的对照组。可理解性测试结果表明，自然语音优于HTS-2008，而HTS-2008又远优于Multisyn系统。然而，在理解任务中，这三种语音系统在语音理解过程中表现出极小的差异。这是因为两种语音合成器已经达到了具有足够的可理解性的阈值，可以提供较高的语音理解质量。因此，尽管HTS-2008和Multisyn系统之间的可理解语音质量相同，但由于HTS-2008语音合成器的可理解性更高，因此建议使用HTS-2008语音合成器。

{"title":"Evaluation of TTS Systems in Intelligibility and Comprehension Tasks: a Case Study of HTS-2008 and Multisyn Synthesizers","authors":"Yu-Yun Chang","doi":"10.30019/IJCLCLP.201209.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0005","url":null,"abstract":"This paper explores the relationship between intelligibility and comprehensibility in speech synthesizers, and it designs an appropriate comprehension task for evaluating the speech synthesizers' comprehensibility. Previous studies have predicted that a speech synthesizer with higher intelligibility will have higher performance in comprehension. Also, since the two most popular speech synthesis methods are HMM-based and unit selection, this study tries to compare whether the HTS-2008 (HMM-based) or Multisyn (unit selection) speech synthesizer has better performance in application. Natural speech is applied in the experiment as a control group to the speech synthesizers. The results in the intelligibility test show that natural speech is better than HTS-2008, which, in turn, is much better than the Multisyn system. In the comprehension task, however, all three of the speech systems display minimal differences in the speech comprehension process. This is because the two speech synthesizers have reached the threshold of having enough intelligibility to provide high speech comprehension quality. Therefore, although there is equal comprehensible speech quality between the HTS-2008 and Multisyn systems, the HTS-2008 speech synthesizer is recommended due to its higher intelligibility.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"125 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text 繁体中文日文名称及汉字变体的处理策略

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-09-01 DOI: 10.30019/IJCLCLP.201209.0004

Chuan-Jie Lin, Jia-Cheng Zhan, Yen-Heng Chen, Chien-Wei Pao

This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji into the corresponding Traditional Chinese characters. The same method can also be used to detect words written in character variants. After integrating generation rules for various types of special words, as well as their probability models, the F-measure of our word segmentation system rises from 94.16% to 96.06%. Another experiment shows that 83.18% of the 862 Japanese names in a set of 109 human-annotated documents can be successfully detected.

本文提出了一种在对繁体中文文本进行分词时识别非繁体中文候选词的方法，包括日本人名(以日本汉字或繁体汉字书写)和词变体。在处理人名时，引入了人名格式的概率模型。我们还提出了一种将日本汉字映射到对应的繁体字的方法。同样的方法也可以用于检测以字符变体书写的单词。在整合了各类特殊词的生成规则及其概率模型后，我们的分词系统的f测度从94.16%提高到96.06%。另一项实验表明，在109个人工标注的文档中，862个日语名字中有83.18%可以被成功检测出来。

引用次数: 1

Effects of Combining Bilingual and Collocational Information on Translation of English and Chinese Verb-Noun Pairs 双语信息和搭配信息结合对英汉动名词对翻译的影响

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-09-01 DOI: 10.30019/IJCLCLP.201209.0001

Yi-Hsuan Chuang, Chao-Lin Liu, Jing-Shin Chang

We studied a special case of the translation of English verbs in verb-object pairs. Researchers have studied the effects of the linguistic information of the verbs being translated, and many have reported how considering the objects of the verbs will facilitate the quality of translation. In this study, we took an extreme approach-assuming the availability of the Chinese translation of the English object. In a related exploration, we examined how the availability of the Chinese translation of the English verb influences the translation quality of the English nouns in verb phrases with analogous procedures. We explored the issue with 35 thousand VN pairs that we extracted from the training data obtained from the 2011 NTCIR PatentMT workshop and with 4.8 thousand VN pairs that we extracted from a bilingual version of Scientific American magazine. The results indicated that, when the English verbs and objects were known, the additional information about the Chinese translations of the English verbs (or nouns) could improve the translation quality of the English nouns (or verbs) but not significantly. Further experiments were conducted to compare the quality of translation achieved by our programs and by human subjects. Given the same set of information for translation decisions, human subjects did not outperform our programs, reconfirming that good translations depend heavily on contextual information of wider ranges.

我们研究了英语动词动词宾对翻译的一个特例。研究者已经研究了被译动词的语言信息对翻译的影响，许多人都报道了考虑动词的宾语是如何提高翻译质量的。在本研究中，我们采用了一种极端的方法——假设英语对象的中文翻译是可用的。在相关的研究中，我们考察了英语动词汉译的可用性如何影响具有类似程序的动词短语中英语名词的翻译质量。我们用从2011年NTCIR PatentMT研讨会获得的训练数据中提取的3.5万个VN对和从科学美国人杂志的双语版本中提取的4.8万个VN对来探讨这个问题。结果表明，在英语动词和宾语已知的情况下，英语动词(或名词)汉译的附加信息可以提高英语名词(或动词)的翻译质量，但效果不显著。我们进行了进一步的实验来比较我们的程序和人类受试者的翻译质量。给定相同的翻译决策信息集，人类受试者的表现并不优于我们的程序，这再次证实了良好的翻译在很大程度上依赖于更广泛的上下文信息。

{"title":"Effects of Combining Bilingual and Collocational Information on Translation of English and Chinese Verb-Noun Pairs","authors":"Yi-Hsuan Chuang, Chao-Lin Liu, Jing-Shin Chang","doi":"10.30019/IJCLCLP.201209.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0001","url":null,"abstract":"We studied a special case of the translation of English verbs in verb-object pairs. Researchers have studied the effects of the linguistic information of the verbs being translated, and many have reported how considering the objects of the verbs will facilitate the quality of translation. In this study, we took an extreme approach-assuming the availability of the Chinese translation of the English object. In a related exploration, we examined how the availability of the Chinese translation of the English verb influences the translation quality of the English nouns in verb phrases with analogous procedures. We explored the issue with 35 thousand VN pairs that we extracted from the training data obtained from the 2011 NTCIR PatentMT workshop and with 4.8 thousand VN pairs that we extracted from a bilingual version of Scientific American magazine. The results indicated that, when the English verbs and objects were known, the additional information about the Chinese translations of the English verbs (or nouns) could improve the translation quality of the English nouns (or verbs) but not significantly. Further experiments were conducted to compare the quality of translation achieved by our programs and by human subjects. Given the same set of information for translation decisions, human subjects did not outperform our programs, reconfirming that good translations depend heavily on contextual information of wider ranges.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Enhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data 中文无标记分词中条件随机场学习特征工程的改进

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-09-01 DOI: 10.30019/IJCLCLP.201209.0003

Mike Tian-Jian Jiang, Cheng-Wei Shih, Ting-Hao Yang, Chan-Hung Kuo, Richard Tzong-Han Tsai, W. Hsu

This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of F and FOOV, respectively.

本文提出了基于从未标记数据中提取的频繁字符串的几个特征的统一视图，改进了汉语分词的条件随机场(CRF)模型。这些特征包括基于字符的n图(CNG)、基于存取器变化的字符串(AVS)及其左右共存特征(LRAVS)的变化、词条贡献频率(TCF)和具有特定边界重叠方式的词条贡献边界(TCB)。实验以基于crf的CWS最先进的6标签标注方案为基准，数据集来自美国计算语言学协会(ACL) 2005年中国语言处理特别兴趣小组(SIGHAN) CWS Bakeoff和2010年SIGHAN CWS Bakeoff。实验结果表明，所有这些特征都提高了基线系统在查全率、查准率和它们的调和平均值作为F1测量分数方面的性能，在准确率(F)和词汇外识别(FOOV)方面。特别是，这项工作提出了涉及LRAVS/AVS和TCF/TCB的复合特征，它们分别在F和FOOV方面与基于crf的CWS的其他类型特征相竞争。

{"title":"Enhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data","authors":"Mike Tian-Jian Jiang, Cheng-Wei Shih, Ting-Hao Yang, Chan-Hung Kuo, Richard Tzong-Han Tsai, W. Hsu","doi":"10.30019/IJCLCLP.201209.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201209.0003","url":null,"abstract":"This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of F and FOOV, respectively.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127023670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in Two Conversational Corpora 词汇项目的频率、搭配与统计建模——以两种会话语料库中时间表达为例

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-06-01 DOI: 10.30019/IJCLCLP.201206.0003

Sheng-Fu Wang, Jing-Chen Yang, Yu-Yun Chang, Yu-Wen Liu, S. Hsieh

This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people's language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.

本研究探讨语料库频率数据的不同维度对词汇统计建模结果的影响。我们的分析主要集中在最近构建的老年人说话语料库上，该语料库用于揭示老年人的语言使用模式。由20多岁的演讲者提供的对话语料库作为补充材料。研究的目标词是时间表达，这可能揭示老年人的语言是如何组织的。我们基于身体数据的两个不同维度，即原始频率分布和基于搭配的向量，进行了分裂的分层聚类分析。当使用不同维度的数据作为输入时，结果表明目标项以不同的方式聚类。基于频率分布的分析和基于搭配模式的分析是截然不同的。具体来说，基于统计的搭配分析通常会产生更明显的聚类结果，与基于原始频率的聚类分析相比，它能更精细地区分时间项。

{"title":"Frequency, Collocation, and Statistical Modeling of Lexical Items: A Case Study of Temporal Expressions in Two Conversational Corpora","authors":"Sheng-Fu Wang, Jing-Chen Yang, Yu-Yun Chang, Yu-Wen Liu, S. Hsieh","doi":"10.30019/IJCLCLP.201206.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201206.0003","url":null,"abstract":"This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. Our analysis mainly focuses on a recently constructed elderly speaker corpus that is used to reveal patterns of aging people's language use. A conversational corpus contributed by speakers in their 20s serves as complementary material. The target words examined are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduct divisive hierarchical clustering analyses based on two different dimensions of corporal data, namely raw frequency distribution and collocation-based vectors. When different dimensions of data were used as the input, results showed that the target terms were clustered in different ways. Analyses based on frequency distributions and collocational patterns are distinct from each other. Specifically, statistically-based collocational analysis generally produces more distinct clustering results that differentiate temporal terms more delicately than do the ones based on raw frequency.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122381551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development and Testing of Transcription Software for a Southern Min Spoken Corpus 闽南语口语语料库转录软件的开发与测试

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2012-03-01 DOI: 10.30019/IJCLCLP.201203.0001

Jia-Cing Ruan, Chiung-Wen Hsu, J. Myers, Jane S. Tsay

The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chinese character-based tool (Segmentor), the first version of a romanization-based lexicon entry tool called Adult-Corpus Romanization Input Program (ACRIP 1.0), and a revised version of ACRIP that accepts both character and romanization inputs and integrates them with sound files (ACRIP 2.0). In two experiments, naive native speakers of Southern Min were asked to transcribe passages from our corpus of adult spoken Southern Min (Tsay and Myers, in progress), using one or more of these tools. Experiment 1 showed no disadvantage for romanization-based compared with character-based transcription even for untrained transcribers. Experiment 2 showed significant advantages of the new mixed-system tool (ACRIP 2.0) over both Segmentor and ACRIP 1.0, in both speed and accuracy of transcription. Experiment 2 also showed that only minimal additional training brought dramatic improvements in both speed and accuracy. These results suggest that the transcription of non-Mandarin Sinitic languages benefits from flexible, integrated software tools.

对于闽南语(台湾语)来说，转录口语的挑战更加复杂，因为它缺乏一种普遍接受的正字法。本研究报告了协助这种转录的软件工具的开发和测试。本文比较了三种工具，每种工具都代表了我们基于语料库的闽南语词典的不同类型的接口(Tsay, 2007):我们最初的基于中文字符的工具(Segmentor)，第一个基于罗马化的词典输入工具，称为成人语料库罗马化输入程序(ACRIP 1.0)，以及一个接受字符和罗马化输入并将其与声音文件集成的ACRIP修订版(ACRIP 2.0)。在两个实验中，我们要求母语为闽南语的天真人士使用一种或多种工具，从我们的成人闽南语口语语料库中转录段落(Tsay和Myers正在进行中)。实验1显示基于罗马字母的转录与基于字符的转录相比没有任何劣势，即使对于未经训练的转录者也是如此。实验2表明，新的混合系统工具(ACRIP 2.0)在转录速度和准确性方面均优于Segmentor和ACRIP 1.0。实验2还表明，只需要很少的额外训练，就能显著提高速度和准确性。这些结果表明，非普通话汉语的转录受益于灵活的集成软件工具。

{"title":"Development and Testing of Transcription Software for a Southern Min Spoken Corpus","authors":"Jia-Cing Ruan, Chiung-Wen Hsu, J. Myers, Jane S. Tsay","doi":"10.30019/IJCLCLP.201203.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201203.0001","url":null,"abstract":"The usual challenges of transcribing spoken language are compounded for Southern Min (Taiwanese) because it lacks a generally accepted orthography. This study reports the development and testing of software tools for assisting such transcription. Three tools are compared, each representing a different type of interface with our corpus-based Southern Min lexicon (Tsay, 2007): our original Chinese character-based tool (Segmentor), the first version of a romanization-based lexicon entry tool called Adult-Corpus Romanization Input Program (ACRIP 1.0), and a revised version of ACRIP that accepts both character and romanization inputs and integrates them with sound files (ACRIP 2.0). In two experiments, naive native speakers of Southern Min were asked to transcribe passages from our corpus of adult spoken Southern Min (Tsay and Myers, in progress), using one or more of these tools. Experiment 1 showed no disadvantage for romanization-based compared with character-based transcription even for untrained transcribers. Experiment 2 showed significant advantages of the new mixed-system tool (ACRIP 2.0) over both Segmentor and ACRIP 1.0, in both speed and accuracy of transcription. Experiment 2 also showed that only minimal additional training brought dramatic improvements in both speed and accuracy. These results suggest that the transcription of non-Mandarin Sinitic languages benefits from flexible, integrated software tools.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132636449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Histogram Equalization on Statistical Approaches for Chinese Unknown Word Extraction 中文未知词提取统计方法的直方图均衡化

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2011-12-01 DOI: 10.30019/IJCLCLP.201112.0003

Bor-Shen Lin, Yi-Cong Chen

With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.

随着人类生活的进化和信息的传播，新事物层出不穷，每天都有新的术语产生。因此，对自然语言处理系统来说，随着时间的推移提取新词是很重要的。然而，由于应用领域广泛，训练域和测试域之间可能存在统计特征不匹配的问题，这将不可避免地降低词提取的性能。本文提出了一种利用直方图均衡化进行特征归一化的词提取方案。通过该方案，可以适当补偿因语料库大小或领域变化而导致的特征分布不匹配，使未知词提取更加可靠，适用于新手领域。该方案最初在sigan2中公布的语料库上进行了评估。使用均衡化的4个组合特征，CKIP和CUHK测试集的单词识别f测度分别达到68.43%和71.40%，对应于IV/OOV的召回率分别为66.72%/32.94%和75.99%/58.39%。当应用于未知词提取对于新手领域,这个方案可以识别等代词“海角七號”(海角七号,一个电影的名字),“蠟筆小新”(蜡笔Shinchan,一个卡通人物的名字),“金融海嘯”(金融海啸)等等,不能提取的可靠与基于规则的方法,尽管这种方法似乎不太擅长识别人类的名字,等方面的地方,或组织,语义结构突出。该方案与两种分词系统的结果是互补的，如果其他基于规则的方法可以进一步集成，则前景广阔。

{"title":"Histogram Equalization on Statistical Approaches for Chinese Unknown Word Extraction","authors":"Bor-Shen Lin, Yi-Cong Chen","doi":"10.30019/IJCLCLP.201112.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.201112.0003","url":null,"abstract":"With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115192051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0