Int. J. Comput. Linguistics Chin. Lang. Process.最新文献

英文中文

Chinese Main Verb Identification: From Specification to Realization 汉语动词识别:从规范到实现

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2005-03-01 DOI: 10.30019/IJCLCLP.200503.0004

Binggong Ding, C. Huang, Degen Huang

Main verb identification is the task of automatically identifying the predicate-verb in a sentence. It is useful for many applications in Chinese Natural Language Processing. Although most studies have focused on the model used to identify the main verb, the definition of the main verb should not be overlooked. In our specification design, we have found many complicated issues that still need to be resolved since they haven't been well discussed in previous works. Thus, the first novel aspect of our work is that we carefully design a specification for annotating the main verb and investigate various complicated cases. We hope this discussion will help to uncover the difficulties involved in this problem. Secondly, we present an approach to realizing main verb identification based on the use of chunk information, which leads to better results than the approach based on part-of-speech. Finally, based on careful observation of the studied corpus, we propose new local and contextual features for main verb identification. According to our specification, we annotate a corpus and then use a Support Vector Machine (SVM) to integrate all the features we propose. Our model, which was trained on our annotated corpus, achieved a promising F score of 92.8%. Furthermore, we show that main verb identification can improve the performance of the Chinese Sentence Breaker, one of the applications of main verb identification, by 2.4%.

主谓识别是对句子中谓语动词进行自动识别的任务。该方法在汉语自然语言处理中具有广泛的应用价值。虽然大多数研究都集中在识别主动词的模型上，但主动词的定义也不容忽视。在我们的规范设计中，我们发现许多复杂的问题仍然需要解决，因为它们在以前的工作中没有得到很好的讨论。因此，我们工作的第一个新颖方面是，我们仔细设计了一个注释主动词的规范，并研究了各种复杂的情况。我们希望这次讨论将有助于揭示这个问题所涉及的困难。其次，我们提出了一种基于块信息的主动词识别方法，该方法比基于词性的方法具有更好的识别效果。最后，在仔细观察所研究的语料库的基础上，我们提出了新的局部和语境特征来识别主动词。根据我们的规范，我们对语料库进行标注，然后使用支持向量机(SVM)对我们提出的所有特征进行集成。我们的模型在我们标注的语料库上进行了训练，获得了92.8%的F分。此外，我们发现主动词识别可以使汉语断句(主动词识别的应用之一)的性能提高2.4%。

{"title":"Chinese Main Verb Identification: From Specification to Realization","authors":"Binggong Ding, C. Huang, Degen Huang","doi":"10.30019/IJCLCLP.200503.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200503.0004","url":null,"abstract":"Main verb identification is the task of automatically identifying the predicate-verb in a sentence. It is useful for many applications in Chinese Natural Language Processing. Although most studies have focused on the model used to identify the main verb, the definition of the main verb should not be overlooked. In our specification design, we have found many complicated issues that still need to be resolved since they haven't been well discussed in previous works. Thus, the first novel aspect of our work is that we carefully design a specification for annotating the main verb and investigate various complicated cases. We hope this discussion will help to uncover the difficulties involved in this problem. Secondly, we present an approach to realizing main verb identification based on the use of chunk information, which leads to better results than the approach based on part-of-speech. Finally, based on careful observation of the studied corpus, we propose new local and contextual features for main verb identification. According to our specification, we annotate a corpus and then use a Support Vector Machine (SVM) to integrate all the features we propose. Our model, which was trained on our annotated corpus, achieved a promising F score of 92.8%. Furthermore, we show that main verb identification can improve the performance of the Chinese Sentence Breaker, one of the applications of main verb identification, by 2.4%.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130912052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Similarity Based Chinese Synonym Collocation Extraction 基于相似度的汉语同义词搭配提取

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2005-03-01 DOI: 10.30019/IJCLCLP.200503.0006

Wanyin Li, Q. Lu, Ruifeng Xu

Collocation extraction systems based on pure statistical methods suffer from two major problems. The first problem is their relatively low precision and recall rates. The second problem is their difficulty in dealing with sparse collocations. In order to improve performance, both statistical and lexicographic approaches should be considered. This paper presents a new method to extract synonymous collocations using semantic information. The semantic information is obtained by calculating similarities from HowNet. We have successfully extracted synonymous collocations which normally cannot be extracted using lexical statistics. Our evaluation conducted on a 60MB tagged corpus shows that we can extract synonymous collocations that occur with very low frequency and that the improvement in the recall rate is close to 100%. In addition, compared with a collocation extraction system based on the Xtract system for English, our algorithm can improve the precision rate by about 44%.

基于纯统计方法的搭配抽取系统存在两个主要问题。第一个问题是它们相对较低的准确率和召回率。第二个问题是它们难以处理稀疏搭配。为了提高性能，应该同时考虑统计和词典编纂方法。提出了一种利用语义信息提取同义搭配的新方法。通过计算HowNet上的相似度来获得语义信息。我们成功地提取了同义搭配，这通常是无法用词汇统计提取的。我们在一个60MB的标记语料库上进行的评估表明，我们可以提取频率很低的同义搭配，召回率的提高接近100%。此外，与基于Xtract系统的英语搭配提取系统相比，我们的算法可将准确率提高约44%。

引用次数: 12

Detecting Emotions in Mandarin Speech 普通话语音中的情绪检测

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-09-01 DOI: 10.30019/IJCLCLP.200509.0004

T. Pao, Yu-Te Chen, Jun-Heng Yeh, Wen-Yuan Liao

The importance of automatically recognizing emotions in human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. In this paper, a Mandarin speech based emotion classification method is presented. Five primary human emotions, including anger, boredom, happiness, neutral and sadness, are investigated. Combining different feature streams to obtain a more accurate result is a well-known statistical technique. For speech emotion recognition, we combined 16 LPC coefficients, 12 LPCC components, 16 LFPC components, 16 PLP coefficients, 20 MFCC components and jitter as the basic features to form the feature vector. Two corpora were employed. The recognizer presented in this paper is based on three classification techniques: LDA, K-NN and HMMs. Results show that the selected features are robust and effective for the emotion recognition in the valence and arousal dimensions of the two corpora. Using the HMMs emotion classification method, an average accuracy of 88.7% was achieved.

随着语音接口在人机交互应用中的作用越来越大，自动识别人类语音中的情绪的重要性也越来越大。本文提出了一种基于普通话语音的情感分类方法。五种主要的人类情绪，包括愤怒，无聊，快乐，中性和悲伤，进行了调查。结合不同的特征流来获得更准确的结果是一种众所周知的统计技术。在语音情感识别中，我们将16个LPC系数、12个LPCC分量、16个LFPC分量、16个PLP系数、20个MFCC分量和抖动作为基本特征组成特征向量。使用了两个语料库。本文提出的识别器是基于三种分类技术:LDA、K-NN和hmm。结果表明，所选特征在两种语料库的效价和觉醒维度上都具有鲁棒性和有效性。采用hmm情绪分类方法，平均准确率达到88.7%。

引用次数: 40

Automated Alignment and Extraction of a Bilingual Ontology for Cross-Language Domain-Specific Applications 面向跨语言领域特定应用的双语本体的自动对齐和提取

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-08-23 DOI: 10.3115/1220355.1220519

Jui-Feng Yeh, Chung-Hsien Wu, Ming-Jun Chen, Liang-Chih Yu

In this paper we propose a novel approach for ontology alignment and domain ontology extraction from the existing knowledge bases, WordNet and HowNet. These two knowledge bases are aligned to construct a bilingual ontology based on the cooccurrence of the words in the sentence pairs of a parallel corpus. The bilingual ontology has the merit that it contains more structural and semantic information coverage from these two complementary knowledge bases. For domainspecific applications, the domain specific ontology is further extracted from the bilingual ontology by the island-driven algorithm and the domain-specific corpus. Finally, the domain-dependent terminologies and some axioms between domain terminologies are integrated into the ontology. For ontology evaluation, experiments were conducted by comparing the benchmark constructed by the ontology engineers or experts. The experimental results show that the proposed approach can extract an aligned bilingual domain-specific ontology.

本文提出了一种基于现有知识库WordNet和HowNet的本体对齐和领域本体提取的新方法。基于并行语料库中句子对中单词的共现，将这两个知识库对齐，构建一个双语本体。双语本体的优点是它包含了更多来自这两个互补知识库的结构和语义信息。对于特定领域的应用，通过孤岛驱动算法和特定领域语料库，进一步从双语本体中提取出特定领域的本体。最后，将领域相关术语和领域术语之间的公理集成到本体中。对于本体评价，通过对比本体工程师或专家构建的基准进行实验。实验结果表明，该方法可以提取出对齐的双语领域本体。

引用次数: 14

Toward Constructing A Multilingual Speech Corpus for Taiwanese (Min-nan), Hakka, and Mandarin Chinese 建构台语(闽南语)、客家话与国语多语语料库之探讨

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-08-01 DOI: 10.30019/IJCLCLP.200408.0001

Ren-Yuan Lyu, Min-Siong Liang, Yuang-Chin Chiang

The Formosa speech database (ForSDat) is a multilingual speech corpus collected at Chang Gung University and sponsored by the National Science Council of Taiwan. It is expected that a multilingual speech corpus will be collected, covering the three most frequently used languages in Taiwan: Taiwanese (Min-nan), Hakka, and Mandarin. This 3-year project has the goal of collecting a phonetically abundant speech corpus of more than 1,800 speakers and hundreds of hours of speech. Recently, the first version of this corpus containing speech of 600 speakers of Taiwanese and Mandarin was finished and is ready to be released. It contains about 49 hours of speech and 247,000 utterances.

福尔摩沙语料库(ForSDat)是由台湾科学委员会主办，由长工大学收集的多语言语料库。预计将收集到一个多语种语料库，涵盖台湾最常用的三种语言:闽南语、客家语和普通话。这个为期3年的项目的目标是收集1800多名演讲者和数百小时的语音丰富的语音语料库。最近，这个语料库的第一版已经完成，包含了600名台湾和普通话讲者的演讲，并准备发布。它包含了大约49个小时的演讲和24.7万个话语。

引用次数: 17

Multiple-Translation Spotting for Mandarin-Taiwanese Speech-to-Speech Translation 普通话-台语语音翻译的多重翻译定位

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-08-01 DOI: 10.30019/IJCLCLP.200408.0002

Jhing-Fa Wang, Shun-Chieh Lin, Hsueh-Wei Yang, Fan-Min Li

The critical issues involved in speech-to-spe ech translation are obtaining proper source segments and synthesizing accurate target speech. Therefore, this article develops a novel multiple-translation spotting method to deal with these issues efficiently. Term multiple-translation spotting refers to the task of extracting target-language synthesis patterns that correspond to a given set of source-language spotted patterns in conditional multiple pairs of speech patterns known to be translation patterns. According to the extracted synthesis patterns, the target speech can be properly synthesized by using a waveform segment concatenation-based synthesis method. Experiments were conducted with the languages of Mandarin and Taiwanese. The results reveal that the proposed approach can achieve translation understanding rates of 80% and 76% on average for Mandarin/Taiwanese translation and Taiwanese/Mandarin translation, respectively.

语音翻译的关键问题是获取合适的源语段和合成准确的目标语。为此，本文提出了一种新的多译文识别方法来有效地解决这些问题。术语多重翻译识别是指从已知翻译模式的有条件的多对语音模式中提取出与一组给定的源语言识别模式相对应的目标语言综合模式。根据提取的合成模式，采用基于波形段拼接的合成方法对目标语音进行适当的合成。实验用普通话和台湾语进行。结果表明，该方法在国语/台语翻译和国语/台语翻译上的平均理解率分别达到80%和76%。

引用次数: 3

The Properties and Further Applications of Chinese Frequent Strings 汉语频繁串的性质及其进一步应用

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-02-01 DOI: 10.30019/IJCLCLP.200402.0007

Yih-Jeng Lin, Ming-Shing Yu

This paper reveals some important properties of CFSs and applications in Chinese natural language processing (NLP). We have previously proposed a method for extracting Chinese frequent strings that contain unknown words from a Chinese corpus [Lin and Yu 2001]. We found that CFSs contain many 4-character strings, 3-word strings, and longer n-grams. Such information can only be derived from an extremely large corpus using a traditional language model (LM). In contrast to using a traditional LM, we can achieve high precision and efficiency by using CFSs to solve Chinese toneless phoneme-to-character conversion and to correct Chinese spelling errors with a small training corpus. An accuracy rate of 92.86% was achieved for Chinese toneless phoneme-to-character conversion, and an accuracy rate of 87.32% was achieved for Chinese spelling error correction. We also attempted to assign syntactic categories to a CFS. The accuracy rate for assigning syntactic categories to the CFSs was 88.53% for outside testing when the syntactic categories of the highest level were used.

本文揭示了CFSs的一些重要性质及其在汉语自然语言处理(NLP)中的应用。我们之前提出了一种从中文语料库中提取包含未知单词的中文频繁字符串的方法[Lin and Yu 2001]。我们发现CFSs包含许多4字串、3字串和更长的n-gram。这样的信息只能使用传统的语言模型(LM)从一个非常大的语料库中获得。与传统LM相比，我们可以利用CFSs解决汉语无声调音字转换问题，并在较小的训练语料库中纠正汉语拼写错误，从而达到较高的精度和效率。对汉语无音音字转换的正确率达到92.86%，对汉语拼写错误的正确率达到87.32%。我们还尝试为CFS分配语法类别。在外部测试中，使用最高水平的句法类别时，对句法类别分配的准确率为88.53%。

引用次数: 6

Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model 孟子:基于最大熵混合模型的中文命名实体识别器

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-02-01 DOI: 10.30019/IJCLCLP.200402.0004

Richard Tzong-Han Tsai, Shih-Hung Wu, Cheng-Wei Lee, Cheng-Wei Shih, W. Hsu

This paper presents a Chinese named entity recognizer (NER): Mencius. It aims to address Chinese NER problems by combining the advantages of rule-based and machine learning (ML) based NER systems. Rule-based NER systems can explicitly encode human comprehension and can be tuned conveniently, while ML-based systems are robust, portable and inexpensive to develop. Our hybrid system incorporates a rule-based knowledge representation and template-matching tool, called InfoMap [Wu et al. 2002], into a maximum entropy (ME) framework. Named entities are represented in InfoMap as templates, which serve as ME features in Mencius. These features are edited manually, and their weights are estimated by the ME framework according to the training data. To understand how word segmentation might influence Chinese NER and the differences between a pure template-based method and our hybrid method, we configure Mencius using four distinct settings. The F-Measures of person names (PER), location names (LOC) and organization names (ORO) of the best configuration in our experiment were respectively 94.3%, 77.8% and 75.3%. From comparing the experiment results obtained using these configurations reveals that hybrid NER Systems always perform better performance in identifying person names. On the other hand, they have a little difficulty identifying location and organization names. Furthermore, using a word segmentation module improves the performance of pure Template-based NER Systems, but, it has little effect on hybrid NER systems.

本文提出了一个中文命名实体识别器——孟子。它旨在通过结合基于规则和基于机器学习(ML)的NER系统的优势来解决中国的NER问题。基于规则的NER系统可以显式地对人类的理解进行编码，并且可以方便地进行调整，而基于ml的系统则具有鲁棒性、可移植性和开发成本低廉的特点。我们的混合系统将一个基于规则的知识表示和模板匹配工具，称为InfoMap [Wu et al. 2002]，整合到最大熵(ME)框架中。命名实体在InfoMap中表示为模板，作为孟子中的ME特性。这些特征是手工编辑的，ME框架根据训练数据估计它们的权重。为了了解分词如何影响汉语的NER，以及纯基于模板的方法和混合方法之间的差异，我们使用了四种不同的设置来配置孟子。在我们的实验中，最佳配置的人名(PER)、地点名称(LOC)和组织名称(ORO)的f值分别为94.3%、77.8%和75.3%。实验结果表明，混合NER系统在人名识别方面具有较好的性能。另一方面，他们在确定地点和组织名称方面有一点困难。此外，使用分词模块可以提高基于模板的纯NER系统的性能，但对混合NER系统影响不大。

{"title":"Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model","authors":"Richard Tzong-Han Tsai, Shih-Hung Wu, Cheng-Wei Lee, Cheng-Wei Shih, W. Hsu","doi":"10.30019/IJCLCLP.200402.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0004","url":null,"abstract":"This paper presents a Chinese named entity recognizer (NER): Mencius. It aims to address Chinese NER problems by combining the advantages of rule-based and machine learning (ML) based NER systems. Rule-based NER systems can explicitly encode human comprehension and can be tuned conveniently, while ML-based systems are robust, portable and inexpensive to develop. Our hybrid system incorporates a rule-based knowledge representation and template-matching tool, called InfoMap [Wu et al. 2002], into a maximum entropy (ME) framework. Named entities are represented in InfoMap as templates, which serve as ME features in Mencius. These features are edited manually, and their weights are estimated by the ME framework according to the training data. To understand how word segmentation might influence Chinese NER and the differences between a pure template-based method and our hybrid method, we configure Mencius using four distinct settings. The F-Measures of person names (PER), location names (LOC) and organization names (ORO) of the best configuration in our experiment were respectively 94.3%, 77.8% and 75.3%. From comparing the experiment results obtained using these configurations reveals that hybrid NER Systems always perform better performance in identifying person names. On the other hand, they have a little difficulty identifying location and organization names. Furthermore, using a word segmentation module improves the performance of pure Template-based NER Systems, but, it has little effect on hybrid NER systems.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126639489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Automatic Pronominal Anaphora Resolution in English Texts 英语语篇代词回指自动消解

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2004-02-01 DOI: 10.30019/IJCLCLP.200402.0002

Tyne Liang, Dian-Song Wu

Anaphora is a common phenomenon in discourses as well as an important research issue in the applications of natural language processing. In this paper, anaphora resolution is achieved by employing WordNet ontology and heuristic rules. The proposed system identifies both intra-sentential and inter-sentential antecedents of anaphors. Information about animacy is obtained by analyzing the hierarchical relations of nouns and verbs in the surrounding context. The identification of animacy entities and pleonastic-it usage in English discourses are employed to promote resolution accuracy. Traditionally, anaphora resolution systems have relied on syntactic, semantic or pragmatic clues to identify the antecedent of an anaphor. Our proposed method makes use of WordNet ontology to identify animate entities as well as essential gender information. In the animacy agreement module, the property is identified by the hypernym relation between entities and their unique beginners defined in WordNet. In addition, the verb of the entity is also an important clue used to reduce the uncertainty. An experiment was conducted using a balanced corpus to resolve the pronominal anaphora phenomenon. The methods proposed in (Lappin and Leass, 94) and (Mitkov, 01) focus on the corpora with only inanimate pronouns such as "it" or "its". Thus the results of intra-sentential and inter-sentential anaphora distribution are different. In an experiment using Brown corpus, we found that the distribution proportion of intra-sentential anaphora is about 60%. Seven heuristic rules are applied in our system; five of them are preference rules, and two are constraint rules. They are derived from syntactic, semantic, pragmatic conventions and from the analysis of training data. A relative measurement indicates that about 30% of the errors can be eliminated by applying heuristic module.

回指是语篇中的一种常见现象，也是自然语言处理应用中的一个重要研究课题。本文采用WordNet本体和启发式规则实现了回指消解。该系统可以识别句内和句间的回指先行词。通过分析名词和动词在周围语境中的层次关系，可以获得有关动画性的信息。在英语语篇中，通过识别生命实体和使用多语来提高解析的准确性。传统上，回指解析系统依赖句法、语义或语用线索来识别回指的先行词。我们提出的方法利用WordNet本体来识别动物实体以及必要的性别信息。在动画协议模块中，属性由实体和它们在WordNet中定义的唯一初学者之间的首字母关系来标识。此外，实体动词也是减少不确定性的重要线索。利用平衡语料库对代词回指现象进行了消解实验。在(Lappin and Leass, 1994)和(Mitkov, 2001)中提出的方法侧重于只有“it”或“its”等无生命代词的语料库。因此，句内和句间回指分布的结果是不同的。在布朗语料库的实验中，我们发现句内回指的分布比例约为60%。七个启发式规则应用于我们的系统;其中五个是偏好规则，两个是约束规则。它们来源于句法、语义、语用惯例以及对训练数据的分析。相对测量表明，采用启发式模块可消除约30%的误差。

{"title":"Automatic Pronominal Anaphora Resolution in English Texts","authors":"Tyne Liang, Dian-Song Wu","doi":"10.30019/IJCLCLP.200402.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0002","url":null,"abstract":"Anaphora is a common phenomenon in discourses as well as an important research issue in the applications of natural language processing. In this paper, anaphora resolution is achieved by employing WordNet ontology and heuristic rules. The proposed system identifies both intra-sentential and inter-sentential antecedents of anaphors. Information about animacy is obtained by analyzing the hierarchical relations of nouns and verbs in the surrounding context. The identification of animacy entities and pleonastic-it usage in English discourses are employed to promote resolution accuracy. Traditionally, anaphora resolution systems have relied on syntactic, semantic or pragmatic clues to identify the antecedent of an anaphor. Our proposed method makes use of WordNet ontology to identify animate entities as well as essential gender information. In the animacy agreement module, the property is identified by the hypernym relation between entities and their unique beginners defined in WordNet. In addition, the verb of the entity is also an important clue used to reduce the uncertainty. An experiment was conducted using a balanced corpus to resolve the pronominal anaphora phenomenon. The methods proposed in (Lappin and Leass, 94) and (Mitkov, 01) focus on the corpora with only inanimate pronouns such as \"it\" or \"its\". Thus the results of intra-sentential and inter-sentential anaphora distribution are different. In an experiment using Brown corpus, we found that the distribution proportion of intra-sentential anaphora is about 60%. Seven heuristic rules are applied in our system; five of them are preference rules, and two are constraint rules. They are derived from syntactic, semantic, pragmatic conventions and from the analysis of training data. A relative measurement indicates that about 30% of the errors can be eliminated by applying heuristic module.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133617079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses 基于句法和统计分析的双语搭配抽取

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2003-09-01 DOI: 10.30019/IJCLCLP.200402.0001

Chien-Cheng Wu, Jason J. S. Chang

In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.

在本文中，我们描述了一种利用句法和统计分析从并行语料库中提取双语搭配的算法。搭配在所有类型的写作中都很普遍，可以在短语、块、专有名称、习语和术语中找到。因此，单语和双语搭配的自动提取对许多应用都很重要，包括自然语言生成、词义消歧、机器翻译、词典编纂和跨语言信息检索。搭配可分为词汇搭配和语法搭配。实词之间存在词汇搭配，实词与虚词或句法结构之间存在语法搭配。此外，双语搭配在两种语言中都可以是刚性的或灵活的。刚性搭配指的是搭配中的单词必须紧挨着出现，否则(有弹性/有弹性)。本文的重点是提取刚性词汇双语搭配。在我们的方法中，从机器可读字典中的习语和搭配中获得首选语法模式。从平行语料库中的对齐句子中提取匹配模式的搭配。本文提出了一种基于标点统计的句子对齐方法。基于标点的方法优于基于长度的方法，准确率接近98%。然后基于跨语言统计关联对得到的搭配进行匹配。整个搭配以及搭配中的单词之间的统计关联是将搭配与另一种语言中的对应搭配联系起来的一种方法。我们在一个非常大的汉英平行语料库上实现了该方法，取得了令人满意的结果。

{"title":"Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses","authors":"Chien-Cheng Wu, Jason J. S. Chang","doi":"10.30019/IJCLCLP.200402.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0001","url":null,"abstract":"In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115553036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀