首页 > 最新文献

Int. J. Comput. Linguistics Chin. Lang. Process.最新文献

英文 中文
The Sinica Sense Management System: Design and Implementation 中央传感管理系统:设计与实施
Pub Date : 2005-12-01 DOI: 10.30019/IJCLCLP.200512.0001
Chu-Ren Huang, Chun-Ling Chen, Cui-Xia Weng, Hsiang-Ping Lee, Yong-Xiang Chen, Keh-Jiann Chen
A sense-based lexical knowledgebase is a core foundation for language engineering. Two important criteria must be satisfied when constructing a knowledgebase: linguistic felicity and data cohesion. In this paper, we discuss how data cohesion of the sense information collected using the Sinica Sense Management System (SSMS) can be achieved. SSMS manages both lexical entries and word senses, and has been designed and implemented by the Chinese Wordnet Team at Academia Sinica. SSMS contains all the basic information that can be merged with the future Chinese Wordnet. In addition to senses and meaning facets, SSMS also includes the following information: POS, example sentences, corresponding English synset(s) from Princeton WordNet, and lexical semantic relations, such as synonym/antonym and hypernym/hyponym. Moreover, the overarching structure of the system is managed by using a sense serial number, and an inter-entry structure is established by means of cross-references among synsets and homographs. SSMS is not only a versatile development tool and management system for a sense-based lexical knowledgebase. It can also serve as the database backend for both Chinese Wordnet and any sense-based applications for Chinese language processing.
基于语义的词汇知识库是语言工程的核心基础。构建知识库必须满足两个重要的标准:语言的优美性和数据的衔接性。在本文中,我们讨论了如何实现使用中国传感管理系统(SSMS)收集的传感信息的数据内聚。SSMS同时管理词法条目和词义,由中央研究院的中文词网团队设计和实施。SSMS包含了所有可以与未来的中文词网合并的基本信息。除了词义和意义方面,SSMS还包括以下信息:词性句式、例句、普林斯顿WordNet中相应的英语同义词集以及词汇语义关系,如同义词/反义词和上下词。此外,系统的总体结构采用意义序列号管理,并通过同义词集和同形异义词之间的交叉引用建立了条目间结构。SSMS不仅是一个通用的基于语义的词汇知识库开发工具和管理系统。它还可以作为中文Wordnet和任何基于语义的中文语言处理应用程序的数据库后端。
{"title":"The Sinica Sense Management System: Design and Implementation","authors":"Chu-Ren Huang, Chun-Ling Chen, Cui-Xia Weng, Hsiang-Ping Lee, Yong-Xiang Chen, Keh-Jiann Chen","doi":"10.30019/IJCLCLP.200512.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200512.0001","url":null,"abstract":"A sense-based lexical knowledgebase is a core foundation for language engineering. Two important criteria must be satisfied when constructing a knowledgebase: linguistic felicity and data cohesion. In this paper, we discuss how data cohesion of the sense information collected using the Sinica Sense Management System (SSMS) can be achieved. SSMS manages both lexical entries and word senses, and has been designed and implemented by the Chinese Wordnet Team at Academia Sinica. SSMS contains all the basic information that can be merged with the future Chinese Wordnet. In addition to senses and meaning facets, SSMS also includes the following information: POS, example sentences, corresponding English synset(s) from Princeton WordNet, and lexical semantic relations, such as synonym/antonym and hypernym/hyponym. Moreover, the overarching structure of the system is managed by using a sense serial number, and an inter-entry structure is established by means of cross-references among synsets and homographs. SSMS is not only a versatile development tool and management system for a sense-based lexical knowledgebase. It can also serve as the database backend for both Chinese Wordnet and any sense-based applications for Chinese language processing.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123301813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
An Unsupervised Approach to Chinese Word Sense Disambiguation Based on Hownet 基于Hownet的汉语词义消歧的无监督方法
Pub Date : 2005-12-01 DOI: 10.30019/IJCLCLP.200512.0005
Hao Chen, Tingting He, D. Ji, Changqin Quan
The research on word sense disambiguation (WSD) has great theoretical and practical significance in many fields of natural language processing (NLP). This paper presents an unsupervised approach to Chinese word sense disambiguation based on Hownet (an electronic Chinese lexical resource). In our approach, contexts that include ambiguous words are converted into vectors by means of a second-order context method, and these context vectors are then clustered by the k-means clustering algorithm. Lastly, the ambiguous words can be disambiguated after a similarity calculation process is completed. Our experiments involved extraction of terms, and an 82.62% average accuracy rate was achieved.
词义消歧的研究在自然语言处理(NLP)的许多领域都具有重要的理论和现实意义。本文提出了一种基于知网(一个汉语电子词汇资源)的无监督汉语词义消歧方法。在我们的方法中,包含歧义词的上下文通过二阶上下文方法转换为向量,然后通过k-means聚类算法对这些上下文向量进行聚类。最后,对歧义词进行相似度计算,消除歧义。我们的实验涉及术语的提取,平均准确率达到82.62%。
{"title":"An Unsupervised Approach to Chinese Word Sense Disambiguation Based on Hownet","authors":"Hao Chen, Tingting He, D. Ji, Changqin Quan","doi":"10.30019/IJCLCLP.200512.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200512.0005","url":null,"abstract":"The research on word sense disambiguation (WSD) has great theoretical and practical significance in many fields of natural language processing (NLP). This paper presents an unsupervised approach to Chinese word sense disambiguation based on Hownet (an electronic Chinese lexical resource). In our approach, contexts that include ambiguous words are converted into vectors by means of a second-order context method, and these context vectors are then clustered by the k-means clustering algorithm. Lastly, the ambiguous words can be disambiguated after a similarity calculation process is completed. Our experiments involved extraction of terms, and an 82.62% average accuracy rate was achieved.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"456 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115283788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition 普通话/台语双语语音识别的语音变异建模
Pub Date : 2005-09-01 DOI: 10.30019/IJCLCLP.200509.0005
Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-Chin Chiang, Chun-Nan Hsu
In this paper, a bi-lingual large vocaburary speech recognition experiment based on the idea of modeling pronunciation variations is described. The two languages under study are Mandarin Chinese and Taiwanese (Min-nan). These two languages are basically mutually unintelligible, and they have many words with the same Chinese characters and the same meanings, although they are pronounced differently. Observing the bi-lingual corpus, we found five types of pronunciation variations for Chinese characters. A one-pass, three-layer recognizer was developed that includes a combination of bi-lingual acoustic models, an integrated pronunciation model, and a tree-structure based searching net. The recognizer's performance was evaluated under three different pronunciation models. The results showed that the character error rate with integrated pronunciation models was better than that with pronunciation models, using either the knowledge-based or the data-driven approach. The relative frequency ratio was also used as a measure to choose the best number of pronunciation variations for each Chinese character. Finally, the best character error rates in Mandarin and Taiwanese testing sets were found to be 16.2% and 15.0%, respectively, when the average number of pronunciations for one Chinese character was 3.9.
本文介绍了一种基于语音变化建模思想的双语大词汇量语音识别实验。正在研究的两种语言是普通话和台湾(闽南语)。这两种语言基本上是互不理解的,它们有许多汉字相同,意思相同的单词,尽管它们的发音不同。通过对双语语料库的观察,我们发现了五种类型的汉字发音变化。开发了一种一遍三层识别器,包括双语声学模型、集成发音模型和基于树结构的搜索网络的组合。在三种不同的发音模型下对识别器的性能进行了评估。结果表明,无论是基于知识的方法还是数据驱动的方法,综合语音模型的字符错误率都优于语音模型。相对频率比也被用来为每个汉字选择最佳的发音变化数量。最后,当一个汉字的平均发音数为3.9个时,普通话和台语测试集的最佳汉字错误率分别为16.2%和15.0%。
{"title":"Modeling Pronunciation Variation for Bi-Lingual Mandarin/Taiwanese Speech Recognition","authors":"Dau-Cheng Lyu, Ren-Yuan Lyu, Yuang-Chin Chiang, Chun-Nan Hsu","doi":"10.30019/IJCLCLP.200509.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200509.0005","url":null,"abstract":"In this paper, a bi-lingual large vocaburary speech recognition experiment based on the idea of modeling pronunciation variations is described. The two languages under study are Mandarin Chinese and Taiwanese (Min-nan). These two languages are basically mutually unintelligible, and they have many words with the same Chinese characters and the same meanings, although they are pronounced differently. Observing the bi-lingual corpus, we found five types of pronunciation variations for Chinese characters. A one-pass, three-layer recognizer was developed that includes a combination of bi-lingual acoustic models, an integrated pronunciation model, and a tree-structure based searching net. The recognizer's performance was evaluated under three different pronunciation models. The results showed that the character error rate with integrated pronunciation models was better than that with pronunciation models, using either the knowledge-based or the data-driven approach. The relative frequency ratio was also used as a measure to choose the best number of pronunciation variations for each Chinese character. Finally, the best character error rates in Mandarin and Taiwanese testing sets were found to be 16.2% and 15.0%, respectively, when the average number of pronunciations for one Chinese character was 3.9.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126575768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Chinese Word Segmentation by Classification of Characters 基于字符分类的汉语分词方法
Pub Date : 2005-09-01 DOI: 10.30019/IJCLCLP.200509.0006
Chooi-Ling Goh, Masayuki Asahara, Yuji Matsumoto
During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM, and the context, we apply a classification method based on Support Vector Machines to re-assign the word boundaries. In so doing, we use the output of a dictionary-based approach, and then apply a machine-learning-based approach to solve the segmentation problem. Experimental results show that our model can achieve an F-measure of 99.0 for overall segmentation, given the condition that there are no unknown words in the text, and an F-measure of 95.1 if unknown words exist.
在汉语分词过程中,主要存在两大问题:分词歧义和未见词。本文提出了一种解决图像分割问题的方法。首先,我们使用基于词典的方法来分割文本。我们应用最大匹配算法对文本进行向前(FMM)和向后(BMM)分割。基于FMM和BMM的区别,结合上下文,采用基于支持向量机的分类方法对词边界进行重新分配。在这样做的过程中,我们使用基于字典的方法的输出,然后应用基于机器学习的方法来解决分割问题。实验结果表明,我们的模型在文本中没有未知词的情况下,整体分割的f测度为99.0,在存在未知词的情况下,f测度为95.1。
{"title":"Chinese Word Segmentation by Classification of Characters","authors":"Chooi-Ling Goh, Masayuki Asahara, Yuji Matsumoto","doi":"10.30019/IJCLCLP.200509.0006","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200509.0006","url":null,"abstract":"During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM, and the context, we apply a classification method based on Support Vector Machines to re-assign the word boundaries. In so doing, we use the output of a dictionary-based approach, and then apply a machine-learning-based approach to solve the segmentation problem. Experimental results show that our model can achieve an F-measure of 99.0 for overall segmentation, given the condition that there are no unknown words in the text, and an F-measure of 95.1 if unknown words exist.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131348834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
MATBN: A Mandarin Chinese Broadcast News Corpus 中文广播新闻语料库
Pub Date : 2005-07-01 DOI: 10.30019/IJCLCLP.200507.0004
H. Wang, Berlin Chen, Jen-wei Kuo, Shih-Sian Cheng
The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce. the speech corpus and report on some preliminary statistical analysis and speech recognition evaluation results.
MATBN普通话广播新闻语料库包含公共电视服务基金会(台湾)共198小时的广播新闻,并附有相应的文字记录。该集合的主要目的是为广播新闻领域的连续语音识别评估提供训练和测试数据。本文对其进行了简要介绍。并对语音语料库进行了一些初步的统计分析和语音识别评价结果报告。
{"title":"MATBN: A Mandarin Chinese Broadcast News Corpus","authors":"H. Wang, Berlin Chen, Jen-wei Kuo, Shih-Sian Cheng","doi":"10.30019/IJCLCLP.200507.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200507.0004","url":null,"abstract":"The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce. the speech corpus and report on some preliminary statistical analysis and speech recognition evaluation results.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132719191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
Mandarin Topic-oriented Conversations 普通话话题导向对话
Pub Date : 2005-07-01 DOI: 10.30019/IJCLCLP.200507.0003
S. Tseng
This paper describes the collection and processing of a pilot speech corpus annotated in dialogue acts. The Mandarin Topic-oriented Conversational Corpus (MTCC) consists of annotated transcripts and sound files of conversations between two familiar persons. Particular features of spoken Mandarin, such as discourse particles and paralinguistic sounds, are taken into account in the orthographical transcription. In addition, the dialogue structure is annotated using an annotation scheme developed for topic-specific conversations. Using the annotated materials, we present the results of a preliminary analysis of dialogue structure and dialogue acts. Related transcription tools and web query applications are also introduced in this paper.
本文介绍了一个对话行为标注的试点语料库的收集和处理。普通话话题导向会话语料库(MTCC)由两个熟悉的人之间的对话的注释文本和声音文件组成。普通话口语的特殊特征,如话语小品和副语言音,在正字法转录中被考虑在内。此外,使用为特定于主题的对话开发的注释方案对对话结构进行注释。利用注释材料,我们提出了对对话结构和对话行为的初步分析结果。本文还介绍了相关的转录工具和web查询应用。
{"title":"Mandarin Topic-oriented Conversations","authors":"S. Tseng","doi":"10.30019/IJCLCLP.200507.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200507.0003","url":null,"abstract":"This paper describes the collection and processing of a pilot speech corpus annotated in dialogue acts. The Mandarin Topic-oriented Conversational Corpus (MTCC) consists of annotated transcripts and sound files of conversations between two familiar persons. Particular features of spoken Mandarin, such as discourse particles and paralinguistic sounds, are taken into account in the orthographical transcription. In addition, the dialogue structure is annotated using an annotation scheme developed for topic-specific conversations. Using the annotated materials, we present the results of a preliminary analysis of dialogue structure and dialogue acts. Related transcription tools and web query applications are also introduced in this paper.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116159810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Formosan Language Archive: Linguistic Analysis and Language Processing 台湾语言档案:语言分析与语言处理
Pub Date : 2005-07-01 DOI: 10.30019/IJCLCLP.200507.0002
Elizabeth Zeitoun, Ching-Hua Yu
In this paper, we deal with the linguistic analysis approach adopted in the Formosan Language Corpora, one of the three main information databases included in the Formosan Language Archive, and the language processing programs that have been built upon it. We first discuss problems related to the transcription of different language corpora. We then deal with annotation rules and standards. We go on to explain the linguistic identification of clauses, sentences and paragraphs, and the computer programs used to obtain an alignment of words, glosses and sentences in Chinese and English. We finally show how we try to cope with analytic inconsistencies through programming. This paper is a complement to Zeitoun et al. [2003] in which we provided an overview of the whole architecture of the Formosan Language Archive.
本文主要讨论台湾文献库三大资讯库之一的台语语料库所采用的语言分析方法,以及在此基础上所建立的语言处理程式。我们首先讨论不同语言语料库的转录问题。然后我们处理注释规则和标准。我们接着解释了分句、句子和段落的语言识别,以及用于获得中英文单词、注释和句子对齐的计算机程序。我们最后展示了如何通过编程来处理分析的不一致性。这篇论文是对Zeitoun等人[2003]的补充,我们在其中提供了台湾语言档案整体架构的概述。
{"title":"The Formosan Language Archive: Linguistic Analysis and Language Processing","authors":"Elizabeth Zeitoun, Ching-Hua Yu","doi":"10.30019/IJCLCLP.200507.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200507.0002","url":null,"abstract":"In this paper, we deal with the linguistic analysis approach adopted in the Formosan Language Corpora, one of the three main information databases included in the Formosan Language Archive, and the language processing programs that have been built upon it. We first discuss problems related to the transcription of different language corpora. We then deal with annotation rules and standards. We go on to explain the linguistic identification of clauses, sentences and paragraphs, and the computer programs used to obtain an alignment of words, glosses and sentences in Chinese and English. We finally show how we try to cope with analytic inconsistencies through programming. This paper is a complement to Zeitoun et al. [2003] in which we provided an overview of the whole architecture of the Formosan Language Archive.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115300526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
TAICAR-The Collection and Annotation of an In-Car Speech Database Created in Taiwan 台湾建立的车内语音资料库之收集与注释
Pub Date : 2005-07-01 DOI: 10.30019/IJCLCLP.200507.0005
Hsien-Chang Wang, Chung-Hsien Yang, Jhing-Fa Wang, Chung-Hsien Wu, Jen-Tzung Chien
This paper describes a project that aims to create a Mandarin speech database for the automobile setting (TAICAR). A group of researchers from several universities and research institutes in Taiwan have participated in the project. The goal is to generate a corpus for the development and testing of various speech-processing techniques. There are six recording sites in this project. Various words, sentences, and spontaneously queries uttered in the vehicular navigation setting have been collected in this project. A preliminary corpus of utterances from 192 speakers was created from utterances generated in different vehicles. The database contains more than 163,000 files, occupying 16.8 gigabytes of disk space.
本文描述了一个旨在为汽车设置(TAICAR)创建普通话语音数据库的项目。来自台湾几所大学和研究机构的一组研究人员参与了该项目。目标是生成一个语料库,用于开发和测试各种语音处理技术。在这个项目中有六个录音地点。在这个项目中收集了车辆导航设置中发出的各种单词、句子和自发查询。从不同的载体中产生的话语创建了来自192个说话人的初步语料库。该数据库包含超过16.3万个文件,占用16.8 gb的磁盘空间。
{"title":"TAICAR-The Collection and Annotation of an In-Car Speech Database Created in Taiwan","authors":"Hsien-Chang Wang, Chung-Hsien Yang, Jhing-Fa Wang, Chung-Hsien Wu, Jen-Tzung Chien","doi":"10.30019/IJCLCLP.200507.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200507.0005","url":null,"abstract":"This paper describes a project that aims to create a Mandarin speech database for the automobile setting (TAICAR). A group of researchers from several universities and research institutes in Taiwan have participated in the project. The goal is to generate a corpus for the development and testing of various speech-processing techniques. There are six recording sites in this project. Various words, sentences, and spontaneously queries uttered in the vehicular navigation setting have been collected in this project. A preliminary corpus of utterances from 192 speakers was created from utterances generated in different vehicles. The database contains more than 163,000 files, occupying 16.8 gigabytes of disk space.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132861929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automatic Segmentation and Labeling for Mandarin Chinese Speech Corpora for Concatenation-based TTS 基于拼接TTS的汉语语音语料库自动分割与标注
Pub Date : 2005-06-01 DOI: 10.30019/IJCLCLP.200507.0001
Chengyuan Lin, J. Jang, Kuan-Ting Chen
Precise phone/syllable boundary labeling of the utterances in a speech corpus plays an important role in constructing a corpus-based TTS (text-to-speech) system. However, automatic labeling based on Viterbi forced alignment does not always produce satisfactory results. Moreover, a suitable labeling method for one language does not necessarily produce desirable results for another language. Hence in this paper, we propose a new procedure for refining the boundaries of utterances in a Mandarin speech corpus. This procedure employs different sets of acoustic features for four different phonetic categories. In addition, a new scheme is proposed to deal with the ”periodic voiced + periodic voiced” case, which produced most of the segmentation errors in our experiment. Several experiments were conducted to demonstrate the feasibility of the proposed approach.
语音语料库中语音的精确语音/音节边界标注对于构建基于语料库的文本到语音(TTS)系统至关重要。然而,基于维特比强制对齐的自动标注并不总是产生令人满意的结果。此外,适合一种语言的标注方法不一定对另一种语言产生理想的结果。因此,本文提出了一种新的汉语语音语料库中语音边界的提炼方法。这个过程采用了四种不同语音类别的不同声学特征集。此外,我们还提出了一种新的分割方案来处理“周期浊音+周期浊音”的情况,这是我们实验中产生大部分分割错误的原因。通过实验验证了该方法的可行性。
{"title":"Automatic Segmentation and Labeling for Mandarin Chinese Speech Corpora for Concatenation-based TTS","authors":"Chengyuan Lin, J. Jang, Kuan-Ting Chen","doi":"10.30019/IJCLCLP.200507.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200507.0001","url":null,"abstract":"Precise phone/syllable boundary labeling of the utterances in a speech corpus plays an important role in constructing a corpus-based TTS (text-to-speech) system. However, automatic labeling based on Viterbi forced alignment does not always produce satisfactory results. Moreover, a suitable labeling method for one language does not necessarily produce desirable results for another language. Hence in this paper, we propose a new procedure for refining the boundaries of utterances in a Mandarin speech corpus. This procedure employs different sets of acoustic features for four different phonetic categories. In addition, a new scheme is proposed to deal with the ”periodic voiced + periodic voiced” case, which produced most of the segmentation errors in our experiment. Several experiments were conducted to demonstrate the feasibility of the proposed approach.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132029058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria 平行双语语料库与标点标准的统计对齐
Pub Date : 2005-03-01 DOI: 10.30019/IJCLCLP.200503.0005
Thomas C. Chuang, K. Yeh
We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.
本文提出了一种基于标点符号的双语平行语料库句子对齐方法。尽管基于长度的方法对于用两种西方语言(如法语-英语或德语-英语)编写的干净的平行语料库(如汉语-英语)产生了很高的句子对齐准确率,但对于用两种不同语言(如汉语-英语)编写的嘈杂平行语料库来说,它的效果并不好。可以在基于长度的方法之上使用同源词来提高对齐精度。然而,在两种完全不同的语言之间不存在同源词,这限制了基于同源词的方法的适用性。在本文中,我们研究了利用两种语言中标点符号的统计顺序匹配来实现高精度句子对齐的可行性。我们对平行语料库、汉英《中国文物学》杂志语料库和《科学美国人》杂志文章进行了实验,取得了满意的结果。实验结果表明,与基于长度的方法相比,该方法具有更高的精度。当在一个共同的统计框架内采用基于标点和基于长度的方法时,观察到非常有希望的改进。我们还演示了该方法可以应用于其他语言对,例如英语-日语,而只需要很少的额外工作。
{"title":"Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria","authors":"Thomas C. Chuang, K. Yeh","doi":"10.30019/IJCLCLP.200503.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200503.0005","url":null,"abstract":"We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written in two disparate languages such as Chinese-English. It is possible to use cognates on top of the length-based approach to increase the alignment accuracy. However, cognates do not exist between two disparate languages, which limit the applicability of the cognate-based approach. In this paper, we examine the feasibility of exploiting the statistically ordered matching of punctuation marks in two languages to achieve high accuracy sentence alignment. We have experimented with an implementation of the proposed method on parallel corpora, the Chinese-English Sinorama Magazine Corpus and Scientific American Magazine articles, with satisfactory results. Compared with the length-based method, the proposed method exhibits better precision rates based on our experimental reuslts. Highly promising improvement was observed when both the punctuation-based and length-based methods were adopted within a common statistical framework. We also demonstrate that the method can be applied to other language pairs, such as English-Japanese, with minimal additional effort.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129768952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
Int. J. Comput. Linguistics Chin. Lang. Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1