Int. J. Comput. Linguistics Chin. Lang. Process.最新文献_第8页

Measuring and Comparing the Productivity of Mandarin Chinese Suffixes 汉语普通话后缀生产力的测量与比较

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2003-02-01 DOI: 10.30019/IJCLCLP.200302.0003

Eiji Nishimoto

The present study attempts to measure and compare the morphological productivity of five Mandarin Chinese suffixes: the verbal suffix -hua, the plural suffix -men, and the nominal suffixes -r, -zi, and -tou. These suffixes are predicted to differ in their degree of productivity : -hua and -men appear to be productive, being able to systematically form a word with a variety of base words, whereas -zi and -tou (and perhaps also -r) may be limited in productivity. Baayen [1989, 1992] proposes the use of corpus data in measuring productivity in word formation. Based on word-token frequencies in a large corpus of texts, his token-based measure of productivity expresses productivity as the probability that a new word form of an affix will be encountered in a corpus. We first use the token-based measure to examine the productivity of the Mandarin suffixes. The present study, then, proposes a type-based measure of productivity that employs the deleted estimation method [Jelinek & Mercer, 1985] in defining unseen words of a corpus and expresses productivity by the ratio of unseen word types to all word types. The proposed type-based measure yields the productivity ranking “-men, -hua, -r, -zi, -tou,” where -men is the most productive and -tou is the least productive. The effects of corpus-data variability on a productivity measure are also examined. The proposed measure is found to obtain a consistent productivity ranking despite variability in corpus data.

本研究试图测量和比较普通话五种词缀的形态生产力:动词缀“花”、复数词缀“men”和名义词缀“r”、“字”和“头”。据预测，这些后缀的效率不同:-hua和-men似乎效率很高，能够系统地用各种基本词组成一个词，而-zi和-tou(也许还有-r)的效率可能有限。Baayen[1989,1992]提出使用语料库数据来衡量构词法的生产力。基于大量文本语料库中的单词标记频率，他的基于标记的生产率度量将生产率表示为语料库中遇到词缀的新单词形式的概率。我们首先使用基于符号的度量来检查普通话后缀的生产力。因此，本研究提出了一种基于类型的生产力衡量方法，该方法采用删除估计方法[Jelinek & Mercer, 1985]来定义语料库中的未见词，并通过未见词类型与所有词类型的比例来表示生产力。提出的基于类型的测量方法产生了生产率排名“-men， -hua， -r， -zi， -tou”，其中-men是生产率最高的，-tou是生产率最低的。语料库数据变异性对生产率度量的影响也进行了检验。尽管语料库数据存在差异，但所提出的度量方法可以获得一致的生产率排名。

{"title":"Measuring and Comparing the Productivity of Mandarin Chinese Suffixes","authors":"Eiji Nishimoto","doi":"10.30019/IJCLCLP.200302.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200302.0003","url":null,"abstract":"The present study attempts to measure and compare the morphological productivity of five Mandarin Chinese suffixes: the verbal suffix -hua, the plural suffix -men, and the nominal suffixes -r, -zi, and -tou. These suffixes are predicted to differ in their degree of productivity : -hua and -men appear to be productive, being able to systematically form a word with a variety of base words, whereas -zi and -tou (and perhaps also -r) may be limited in productivity. Baayen [1989, 1992] proposes the use of corpus data in measuring productivity in word formation. Based on word-token frequencies in a large corpus of texts, his token-based measure of productivity expresses productivity as the probability that a new word form of an affix will be encountered in a corpus. We first use the token-based measure to examine the productivity of the Mandarin suffixes. The present study, then, proposes a type-based measure of productivity that employs the deleted estimation method [Jelinek & Mercer, 1985] in defining unseen words of a corpus and expresses productivity by the ratio of unseen word types to all word types. The proposed type-based measure yields the productivity ranking “-men, -hua, -r, -zi, -tou,” where -men is the most productive and -tou is the least productive. The effects of corpus-data variability on a productivity measure are also examined. The proposed measure is found to obtain a consistent productivity ranking despite variability in corpus data.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129925553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A Study on Word Similarity using Context Vector Models 基于上下文向量模型的词相似度研究

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2002-08-01 DOI: 10.30019/IJCLCLP.200208.0002

Keh-Jiann Chen, Jia-Ming You

There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example-based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy. The taxonomy approaches are more or less semantic-based that do not consider syntactic similarities. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarities. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word, and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.

在处理自然语言时，特别是在使用泛化、分类或基于示例的方法时，需要测量单词相似度。通常，两个词之间的相似度度量是根据语义分类法中它们的语义类之间的距离来定义的。分类法方法或多或少是基于语义的，不考虑语法相似性。然而，在实际应用中，语义和语法相似性都是必需的，并且权重不同。基于上下文向量的词相似度是句法相似度和语义相似度的混合。本文提出仅使用句法相关的共现作为上下文向量，并采用信息论模型解决数据稀疏性和特征精度问题。通过解析每个单词的上下文环境，得到共现上下文特征的概率分布，并根据其IDF(逆文档频率)值调整所有上下文特征。采用聚类算法，根据相似度值对相似词进行分组。事实证明，具有相似句法类别和语义类的单词被分组在一起。

{"title":"A Study on Word Similarity using Context Vector Models","authors":"Keh-Jiann Chen, Jia-Ming You","doi":"10.30019/IJCLCLP.200208.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200208.0002","url":null,"abstract":"There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example-based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy. The taxonomy approaches are more or less semantic-based that do not consider syntactic similarities. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarities. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word, and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133921487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Metaphorical Transfer and Pragmatic Strengthening: On the Development of V-diao in Mandarin 隐喻转移与语用强化:论汉语v调的发展

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2001-08-01 DOI: 10.30019/IJCLCLP.200108.0001

Louis Wei-lun Lu

In this synchronic study, I shall adopt a corpus-based approach to investigate the semantic change of V-diao in Mandarin. Semantically, V-diao constructions fall into three categories: A) Physical disappearance from its original position, with the V slot filled by physical verbs, such as tao-diao ”escape,” diu-diao ”throw away,” and so on. B) Disappearance from a certain conceptual domain, rather than from the physical space, with the V slot filled by less physically perceivable verbs, such as jie-diao ”quit,” wang-diao ”forget,” and the like. C) The third category of V-diao involves the speaker's subjective, always negative, attitude toward the result. Examples include: lan-diao ”rot,” ruan-diao ”soften,” huang-diao ”yellow,” and so forth. It is claimed in this paper that the polysemy between types A and B is motivated by metaphorical transfer [Sweetser, 1990; Bybee, Perkins and Pagliuca, 1994; Heine, Claudi and Hunnemeyer, 1991]. Based roughly on Huang and Chang [1996], I demonstrate that a cognitive restriction on selection of the verb will cause further repetitive occurrence of negative verbs in the V slot. Finally, I shall claim that pragmatic strengthening [Hopper and Traugott, 1993; Bybee, Perkins and Pagliuca, 1994] contributes to the emergence of unfavourable meaning in Type C. Hopefully, this research can serve as a valid argument for the interaction of language use and grammar, and the conceptual basis of human language.

在这次共时性研究中，我将采用基于语料库的方法来研究汉语v调的语义变化。从语义上看，V-调结构可分为三类:A)物理从原位置消失，V槽由物理动词填充，如“逃”、“丢”等。B)从某个概念领域消失，而不是从物理空间消失，V槽由不太容易被物理感知的动词填充，如“退出”、“忘记”等。C)第三类v -调涉及说话人对结果的主观的、通常是否定的态度。例如:蓝雕“烂”，“黄雕”“变软”，“黄雕”“黄”等等。本文认为，A类和B类之间的一词多义是由隐喻迁移驱动的[Sweetser, 1990;Bybee, Perkins和Pagliuca, 1994;Heine, Claudi and Hunnemeyer, 1991]。在Huang和Chang[1996]的基础上，我证明了对动词选择的认知限制会导致否定动词在V槽中进一步重复出现。最后，我主张语用强化[Hopper and Traugott, 1993;Bybee, Perkins和Pagliuca(1994)对c类型中不利意义的出现做出了贡献。希望这项研究可以为语言使用和语法的相互作用以及人类语言的概念基础提供有效的论据。

{"title":"Metaphorical Transfer and Pragmatic Strengthening: On the Development of V-diao in Mandarin","authors":"Louis Wei-lun Lu","doi":"10.30019/IJCLCLP.200108.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200108.0001","url":null,"abstract":"In this synchronic study, I shall adopt a corpus-based approach to investigate the semantic change of V-diao in Mandarin. Semantically, V-diao constructions fall into three categories: A) Physical disappearance from its original position, with the V slot filled by physical verbs, such as tao-diao ”escape,” diu-diao ”throw away,” and so on. B) Disappearance from a certain conceptual domain, rather than from the physical space, with the V slot filled by less physically perceivable verbs, such as jie-diao ”quit,” wang-diao ”forget,” and the like. C) The third category of V-diao involves the speaker's subjective, always negative, attitude toward the result. Examples include: lan-diao ”rot,” ruan-diao ”soften,” huang-diao ”yellow,” and so forth. It is claimed in this paper that the polysemy between types A and B is motivated by metaphorical transfer [Sweetser, 1990; Bybee, Perkins and Pagliuca, 1994; Heine, Claudi and Hunnemeyer, 1991]. Based roughly on Huang and Chang [1996], I demonstrate that a cognitive restriction on selection of the verb will cause further repetitive occurrence of negative verbs in the V slot. Finally, I shall claim that pragmatic strengthening [Hopper and Traugott, 1993; Bybee, Perkins and Pagliuca, 1994] contributes to the emergence of unfavourable meaning in Type C. Hopefully, this research can serve as a valid argument for the interaction of language use and grammar, and the conceptual basis of human language.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125216569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Use of Clustering Techniques for Language Modeling-Application to Asian Language 聚类技术在语言建模中的应用——以亚洲语言为例

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2001-02-01 DOI: 10.30019/IJCLCLP.200102.0002

Jianfeng Gao, Joshua Goodman, J. Miao

Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achieve more than 40% size reduction at the same level of perplexity.

基于聚类的n-图建模是普通基于词的n-图建模的一种变体。它试图利用词语之间的相似性。本文对聚类技术在亚洲语言建模中的应用进行了实证研究。聚类用于提高语言模型的性能(即复杂度)和压缩语言模型。本文在日语报纸语料库和汉语异质语料库上对基于聚类的三词表模型进行了实验测试。以前关于词聚类的大部分研究都集中在如何获得最好的聚类上，而我们的研究集中在使用聚类的最佳方式上。实验结果表明，我们提出的一些新技术比以前的方法效果好得多，在相同的困惑水平上实现了40%以上的尺寸缩减。

引用次数: 47

Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts 汉语非限定文本韵律成分边界的定位

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2001-02-01 DOI: 10.30019/IJCLCLP.200102.0003

Min Chu, Yao Qian

This paper proposes a three-tier prosodic hierarchy, including prosodic word, intermediate phrase and intonational phrase tiers, for Mandarin that emphasizes the use of the prosodic word instead of the lexical word as the basic prosodic unit. Both the surface difference and perceptual difference show that this is helpful for achieving high naturalness in text-to-speech conversion. Three approaches, the basic CART approach, the bottom-up hierarchical approach and the modified hierarchical approach, are presented for locating the boundaries of three prosodic constituents in unrestricted Mandarin texts. Two sets of features are used in the basic CART method: one contains syntactic phrasal information and the other does not. The one with syntactic phrasal information results in about a 1% increase in accuracy and an 11% decrease in error-cost. The performance of the modified hierarchical method produces the highest accuracy, 83%, and lowest error cost when no syntactic phrasal information is provided. It shows advantages in detecting the boundaries of intonational phrases at locations without breaking punctuation. 71.1% precision and 52.4% recall are achieved. Experiments on acceptability reveal that only 26% of the mis-assigned break indices are real infelicitous errors, and that the perceptual difference between the automatically assigned break indices and the manually annotated break indices are small.

本文提出了一种强调使用韵律词而不是词汇词作为基本韵律单位的普通话韵律层次结构，即韵律词、中间短语和语调短语三层。表面差异和感知差异都表明，这有助于实现文本到语音转换的高自然度。本文提出了三种方法，即基本CART法、自底向上分层法和改进分层法，用于定位汉语非限定语篇中三个韵律成分的边界。基本CART方法中使用了两组特征:一组包含语法短语信息，另一组不包含。有句法短语信息的那一组的准确率提高了1%，错误成本降低了11%。在不提供语法短语信息的情况下，改进的分层方法的准确率最高，达到83%，错误代价最低。它在不中断标点符号的情况下，在某些位置检测语调短语的边界。精密度为71.1%，召回率为52.4%。可接受性实验表明，错误分配的中断指标中只有26%是真正的不恰当错误，并且自动分配的中断指标与人工标注的中断指标之间的感知差异很小。

{"title":"Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts","authors":"Min Chu, Yao Qian","doi":"10.30019/IJCLCLP.200102.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200102.0003","url":null,"abstract":"This paper proposes a three-tier prosodic hierarchy, including prosodic word, intermediate phrase and intonational phrase tiers, for Mandarin that emphasizes the use of the prosodic word instead of the lexical word as the basic prosodic unit. Both the surface difference and perceptual difference show that this is helpful for achieving high naturalness in text-to-speech conversion. Three approaches, the basic CART approach, the bottom-up hierarchical approach and the modified hierarchical approach, are presented for locating the boundaries of three prosodic constituents in unrestricted Mandarin texts. Two sets of features are used in the basic CART method: one contains syntactic phrasal information and the other does not. The one with syntactic phrasal information results in about a 1% increase in accuracy and an 11% decrease in error-cost. The performance of the modified hierarchical method produces the highest accuracy, 83%, and lowest error cost when no syntactic phrasal information is provided. It shows advantages in detecting the boundaries of intonational phrases at locations without breaking punctuation. 71.1% precision and 52.4% recall are achieved. Experiments on acceptability reveal that only 26% of the mis-assigned break indices are real infelicitous errors, and that the perceptual difference between the automatically assigned break indices and the manually annotated break indices are small.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131593557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach 日中跨语言信息检索:一种跨语言方法

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2000-08-01 DOI: 10.30019/IJCLCLP.200008.0003

M. Hasan, Yuji Matsumoto

Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- language information retrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimensionality reduction technique to compute a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate conceptual retrieval of both mono- and cross- language information for these languages. Similar indexing approaches for multiple European languages through term association (e.g., latent semantic indexing) or through conceptual mapping (using lexical ontology such as, WordNet) are being intensively explored. The Interlingua approach investigated here with Japanese and Chinese languages, and the term (or concept) association model investigated with the European languages are similar; and these approaches can be easily integrated. Therefore, the proposed Interlingua model can pave the way for handling multilingual information access and retrieval efficiently and uniformly.

电子提供的多语言信息可分为两大类:(1)字母语言信息(类英语的字母语言)和(2)表意语言信息(类中文的表意语言)。近年来，非英语字母语言和表意文字语言(特别是日语和中文)的可用信息正以令人难以置信的速度增长。由于日语和汉语具有表意文字的特性，再加上目前使用的几种编码标准的存在，对这些信息进行高效的处理(表示、索引、检索等)成为一项繁琐的任务。本文提出了一种面向汉字的中、日文信息索引检索模型。我们报告了在汉字空间上单语言和跨语言信息检索的结果，其中文档和查询用面向汉字的向量表示。我们还采用降维技术从初始汉字空间中计算出汉字概念空间(KCS)，这有助于对这些语言的单语言和跨语言信息进行概念检索。通过术语关联(例如，潜在语义索引)或通过概念映射(使用词汇本体，如WordNet)为多种欧洲语言建立类似的索引方法正在被深入探索。这里研究的日语和汉语的Interlingua方法，以及研究欧洲语言的术语(或概念)关联模型是相似的;这些方法可以很容易地集成在一起。因此，所提出的Interlingua模型可以为高效、统一地处理多语言信息存取和检索铺平道路。

{"title":"Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach","authors":"M. Hasan, Yuji Matsumoto","doi":"10.30019/IJCLCLP.200008.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0003","url":null,"abstract":"Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- language information retrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimensionality reduction technique to compute a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate conceptual retrieval of both mono- and cross- language information for these languages. Similar indexing approaches for multiple European languages through term association (e.g., latent semantic indexing) or through conceptual mapping (using lexical ontology such as, WordNet) are being intensively explored. The Interlingua approach investigated here with Japanese and Chinese languages, and the term (or concept) association model investigated with the European languages are similar; and these approaches can be easily integrated. Therefore, the proposed Interlingua model can pave the way for handling multilingual information access and retrieval efficiently and uniformly.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Compiling Taiwanese Learner Corpus of English 编写台湾学习者英语语料库

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2000-08-01 DOI: 10.30019/IJCLCLP.200008.0004

R. H. Shih

This paper presents the mechanisms of and criteria for compiling a new learner corpus of English, the quantitative characteristics of the corpus and a practical example of its pedagogical application. The Taiwanese Learner Corpus of English (TLCE), probably the largest annotated learner corpus of English in Taiwan so far, contains 2105 pieces of English writing (around 730,000 words) from Taiwanese college students majoring in English. It is a useful resource for scholars in Second Language Acquisition (SLA) and English Language Teaching (ELT) areas who wish to find out how people in Taiwan learn English and how to help them learn better. The quantitative information shown in the work reflects the characteristics of learner English in terms of part-of-speech distribution, lexical density, and trigram distribution. The usefulness of the corpus is demonstrated by a means of corpus-based investigation of learners' lack of adverbial collocation knowledge.

本文介绍了编写新英语学习者语料库的机制和标准，语料库的数量特征及其在教学中的应用实例。台湾英语学习者语料库(TLCE)可能是台湾迄今为止最大的注释英语学习者语料库，包含台湾英语专业大学生的2105篇英语写作(约73万字)。它是第二语言习得(SLA)和英语教学(ELT)领域的学者了解台湾人如何学习英语以及如何帮助他们更好地学习英语的有用资源。研究中显示的定量信息反映了学习者英语在词性分布、词汇密度和三元分布方面的特点。通过对学习者状语搭配知识缺失情况的语料库调查，证明了语料库的有效性。

引用次数: 12

Design and Evaluation of Approaches for Automatic Chinese Text 中文自动文本方法的设计与评价

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2000-08-01 DOI: 10.30019/IJCLCLP.200008.0002

Jyh-Jong Tsay, Jing-doo Wang

In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors (kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.

本文提出并评价了几种中文文本分类方法，包括术语提取、术语选择、术语聚类和文本分类。我们提出了一种可扩展的方法，该方法使用频率计数来识别可能重要的术语的左右边界。我们使用术语选择和术语聚类相结合的方法将向量空间的维数降低到一个实用的水平。虽然大量可能的中文术语使得大多数机器学习算法不切实际，但在CAN新闻集的实验中获得的结果表明，使用我们的方法可以显着降低到1200维，同时保持大致相同的分类精度水平。我们还研究并比较了三种常用分类器Rocchio线性分类器、朴素贝叶斯概率分类器和k近邻分类器在中文文本分类中的性能。总的来说，kNN达到了最好的准确率，约为78.3%，但在对新文本进行分类时需要大量的计算时间和内存。Rocchio非常节省时间和记忆，并达到了高水平的准确率，约为75.4%。在实际实施中，Rocchio可能是一个不错的选择。

{"title":"Design and Evaluation of Approaches for Automatic Chinese Text","authors":"Jyh-Jong Tsay, Jing-doo Wang","doi":"10.30019/IJCLCLP.200008.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0002","url":null,"abstract":"In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors (kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Adaptive Word Sense Disambiguation Using Lexical Knowledge in Machine-readable Dictionary 基于词汇知识的机读词典词义自适应消歧

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2000-08-01 DOI: 10.30019/IJCLCLP.200008.0001

Jen-Nan Chen

This paper describes a general framework for adaptive conceptual word sense disambiguation. The proposed system begins with knowledge acquisition from machine-readable dictionaries. Central to the approach is the adaptive step that enriches the initial knowledge base with knowledge gleaned from the partial disambiguated text. Once the knowledge base is adjusted to suit the text at hand, it is applied to the text again to finalize the disambiguation decision. Definitions and example sentences from the Longman Dictionary of Contemporary English are employed as training materials for word sense disambiguation, while passages from the Brown corpus and Wall Street Journal (WSJ) articles are used for testing. An experiment showed that adaptation did significantly improve the success rate. For thirteen highly ambiguous words, the proposed method disambiguated with an average precision rate of 70.5% for the Brown corpus and 77.3% for the WSJ articles.

本文描述了一个自适应概念词义消歧的一般框架。该系统首先从机器可读字典中获取知识。该方法的核心是自适应步骤，该步骤使用从部分消歧文本中收集的知识来丰富初始知识库。一旦知识库调整到适合手头的文本，它将再次应用于文本以完成消歧决策。词义消歧的训练材料采用《朗文当代英语词典》中的定义和例句，测试材料采用布朗语料库和《华尔街日报》文章中的段落。一项实验表明，适应确实显著提高了成功率。对于13个高度歧义的单词，该方法在布朗语料库中的平均歧义准确率为70.5%，在华尔街日报的文章中为77.3%。

引用次数: 2

When Endpoint Meets Endpoint: A Corpus-based Lexical Semantic Study of Mandarin Verbs of Throwing 当终点遇见终点:基于语料库的汉语投掷动词词汇语义研究

Int. J. Comput. Linguistics Chin. Lang. Process.

Pub Date : 2000-02-01 DOI: 10.30019/IJCLCLP.200002.0005

Mei-Chun Liu, Chu-Ren Huang, Charles Lee, Ching-Yi Lee

Since verbal semantics began to receive much attention in linguistics research, many interesting findings have been presented regarding the semantic structure or meaning contrasts in the lexicon of Chinese [cf. Tsai, Huang & Chen, 1996; Tsai et al, 1997; Liu, 1999, etc]. Adopting a corpus-based approach, this paper aims to further study and fine-tune Mandarin verbal semantics by exploring the lexical information specific to verbs of throwing, with four pivotal near-synonomous members: TOU(投), ZHI(擲), DIU(丟), RENG (扔). To account for their semantic differences, two kinds of 'endpoints' are distinguished: the Path-endpoint (i.e., the Goal role) vs. the Event-endpoint (i.e., the resultative state). These two variables are crucial for cross-categorizing the four verbs. Although the verbs all describe a directed motion with a Path in their event structure, they differ in their lexical specifications on participant roles and aspectual composition. TOU and ZHI have a specified Path-endpoint while DIU and RENG do not specify a Path-endpoint. Moreover, TOU and ZHI can be further contrasted in terms of the spatial character of the Path-endpoint they take: TOU selects a spatially bounded Path-endpoint while that of ZHI is unspecified in this regard, as manifested by the fact that TOU collocates most frequently with a CONTAINER-introducing locative. On the other hand, DIU and RENG can be further differentiated in terms of event composition: only DIU, not RENG, allows an aspectual focus on the endpoint of the event contour (the Event-endpoint) since it manifests a resultative use. The observed distinctions are then incorporated into a representational paradigm called the Module-Attribute Representation of Verbal Semantics (MARVS), proposed in Huang & Ahrens [1999]. Finally, conclusions are drawn as to the most effective approach to lexical semantic study of Mandarin as well as theoretical implications in general.

自从语言语义学开始受到语言学研究的关注以来，汉语词汇中的语义结构或意义对比出现了许多有趣的发现[cf. Tsai, Huang & Chen, 1996;Tsai et al .， 1997;刘，1999等]。本文采用基于语料库的方法，通过探索投掷动词特定的词汇信息，进一步研究和微调普通话动词语义，其中有四个关键的近同义词成员:TOU()， ZHI(擲)，DIU()， RENG()。为了解释它们的语义差异，区分了两种“端点”:路径端点(即目标角色)和事件端点(即结果状态)。这两个变量对于四个动词的交叉分类至关重要。虽然这些动词都在其事件结构中描述了具有路径的定向运动，但它们在参与者角色和方面组成方面的词汇规范有所不同。TOU和ZHI有指定的Path-endpoint，而DIU和RENG没有指定Path-endpoint。此外，在路径端点的空间特征上，可以进一步对比TOU和ZHI: TOU选择了一个空间有界的路径端点，而ZHI在这方面没有明确的路径端点，这体现在TOU最频繁地与引入container的位置搭配。另一方面，DIU和RENG可以在事件组成方面进一步区分:只有DIU(而不是RENG)允许从方面关注事件轮廓的端点(event -endpoint)，因为它体现了结果用法。然后将观察到的差异合并到Huang和Ahrens[1999]提出的表征范式中，称为语言语义的模块属性表示(MARVS)。最后，总结了汉语词汇语义研究的最有效途径及其理论意义。

{"title":"When Endpoint Meets Endpoint: A Corpus-based Lexical Semantic Study of Mandarin Verbs of Throwing","authors":"Mei-Chun Liu, Chu-Ren Huang, Charles Lee, Ching-Yi Lee","doi":"10.30019/IJCLCLP.200002.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200002.0005","url":null,"abstract":"Since verbal semantics began to receive much attention in linguistics research, many interesting findings have been presented regarding the semantic structure or meaning contrasts in the lexicon of Chinese [cf. Tsai, Huang & Chen, 1996; Tsai et al, 1997; Liu, 1999, etc]. Adopting a corpus-based approach, this paper aims to further study and fine-tune Mandarin verbal semantics by exploring the lexical information specific to verbs of throwing, with four pivotal near-synonomous members: TOU(投), ZHI(擲), DIU(丟), RENG (扔). To account for their semantic differences, two kinds of 'endpoints' are distinguished: the Path-endpoint (i.e., the Goal role) vs. the Event-endpoint (i.e., the resultative state). These two variables are crucial for cross-categorizing the four verbs. Although the verbs all describe a directed motion with a Path in their event structure, they differ in their lexical specifications on participant roles and aspectual composition. TOU and ZHI have a specified Path-endpoint while DIU and RENG do not specify a Path-endpoint. Moreover, TOU and ZHI can be further contrasted in terms of the spatial character of the Path-endpoint they take: TOU selects a spatially bounded Path-endpoint while that of ZHI is unspecified in this regard, as manifested by the fact that TOU collocates most frequently with a CONTAINER-introducing locative. On the other hand, DIU and RENG can be further differentiated in terms of event composition: only DIU, not RENG, allows an aspectual focus on the endpoint of the event contour (the Event-endpoint) since it manifests a resultative use. The observed distinctions are then incorporated into a representational paradigm called the Module-Attribute Representation of Verbal Semantics (MARVS), proposed in Huang & Ahrens [1999]. Finally, conclusions are drawn as to the most effective approach to lexical semantic study of Mandarin as well as theoretical implications in general.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132250471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17