Pub Date : 2003-02-01DOI: 10.30019/IJCLCLP.200302.0003
Eiji Nishimoto
The present study attempts to measure and compare the morphological productivity of five Mandarin Chinese suffixes: the verbal suffix -hua, the plural suffix -men, and the nominal suffixes -r, -zi, and -tou. These suffixes are predicted to differ in their degree of productivity : -hua and -men appear to be productive, being able to systematically form a word with a variety of base words, whereas -zi and -tou (and perhaps also -r) may be limited in productivity. Baayen [1989, 1992] proposes the use of corpus data in measuring productivity in word formation. Based on word-token frequencies in a large corpus of texts, his token-based measure of productivity expresses productivity as the probability that a new word form of an affix will be encountered in a corpus. We first use the token-based measure to examine the productivity of the Mandarin suffixes. The present study, then, proposes a type-based measure of productivity that employs the deleted estimation method [Jelinek & Mercer, 1985] in defining unseen words of a corpus and expresses productivity by the ratio of unseen word types to all word types. The proposed type-based measure yields the productivity ranking “-men, -hua, -r, -zi, -tou,” where -men is the most productive and -tou is the least productive. The effects of corpus-data variability on a productivity measure are also examined. The proposed measure is found to obtain a consistent productivity ranking despite variability in corpus data.
{"title":"Measuring and Comparing the Productivity of Mandarin Chinese Suffixes","authors":"Eiji Nishimoto","doi":"10.30019/IJCLCLP.200302.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200302.0003","url":null,"abstract":"The present study attempts to measure and compare the morphological productivity of five Mandarin Chinese suffixes: the verbal suffix -hua, the plural suffix -men, and the nominal suffixes -r, -zi, and -tou. These suffixes are predicted to differ in their degree of productivity : -hua and -men appear to be productive, being able to systematically form a word with a variety of base words, whereas -zi and -tou (and perhaps also -r) may be limited in productivity. Baayen [1989, 1992] proposes the use of corpus data in measuring productivity in word formation. Based on word-token frequencies in a large corpus of texts, his token-based measure of productivity expresses productivity as the probability that a new word form of an affix will be encountered in a corpus. We first use the token-based measure to examine the productivity of the Mandarin suffixes. The present study, then, proposes a type-based measure of productivity that employs the deleted estimation method [Jelinek & Mercer, 1985] in defining unseen words of a corpus and expresses productivity by the ratio of unseen word types to all word types. The proposed type-based measure yields the productivity ranking “-men, -hua, -r, -zi, -tou,” where -men is the most productive and -tou is the least productive. The effects of corpus-data variability on a productivity measure are also examined. The proposed measure is found to obtain a consistent productivity ranking despite variability in corpus data.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129925553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-01DOI: 10.30019/IJCLCLP.200208.0002
Keh-Jiann Chen, Jia-Ming You
There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example-based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy. The taxonomy approaches are more or less semantic-based that do not consider syntactic similarities. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarities. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word, and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.
{"title":"A Study on Word Similarity using Context Vector Models","authors":"Keh-Jiann Chen, Jia-Ming You","doi":"10.30019/IJCLCLP.200208.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200208.0002","url":null,"abstract":"There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example-based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy. The taxonomy approaches are more or less semantic-based that do not consider syntactic similarities. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarities. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word, and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133921487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-08-01DOI: 10.30019/IJCLCLP.200108.0001
Louis Wei-lun Lu
In this synchronic study, I shall adopt a corpus-based approach to investigate the semantic change of V-diao in Mandarin. Semantically, V-diao constructions fall into three categories: A) Physical disappearance from its original position, with the V slot filled by physical verbs, such as tao-diao ”escape,” diu-diao ”throw away,” and so on. B) Disappearance from a certain conceptual domain, rather than from the physical space, with the V slot filled by less physically perceivable verbs, such as jie-diao ”quit,” wang-diao ”forget,” and the like. C) The third category of V-diao involves the speaker's subjective, always negative, attitude toward the result. Examples include: lan-diao ”rot,” ruan-diao ”soften,” huang-diao ”yellow,” and so forth. It is claimed in this paper that the polysemy between types A and B is motivated by metaphorical transfer [Sweetser, 1990; Bybee, Perkins and Pagliuca, 1994; Heine, Claudi and Hunnemeyer, 1991]. Based roughly on Huang and Chang [1996], I demonstrate that a cognitive restriction on selection of the verb will cause further repetitive occurrence of negative verbs in the V slot. Finally, I shall claim that pragmatic strengthening [Hopper and Traugott, 1993; Bybee, Perkins and Pagliuca, 1994] contributes to the emergence of unfavourable meaning in Type C. Hopefully, this research can serve as a valid argument for the interaction of language use and grammar, and the conceptual basis of human language.
在这次共时性研究中,我将采用基于语料库的方法来研究汉语v调的语义变化。从语义上看,V-调结构可分为三类:A)物理从原位置消失,V槽由物理动词填充,如“逃”、“丢”等。B)从某个概念领域消失,而不是从物理空间消失,V槽由不太容易被物理感知的动词填充,如“退出”、“忘记”等。C)第三类v -调涉及说话人对结果的主观的、通常是否定的态度。例如:蓝雕“烂”,“黄雕”“变软”,“黄雕”“黄”等等。本文认为,A类和B类之间的一词多义是由隐喻迁移驱动的[Sweetser, 1990;Bybee, Perkins和Pagliuca, 1994;Heine, Claudi and Hunnemeyer, 1991]。在Huang和Chang[1996]的基础上,我证明了对动词选择的认知限制会导致否定动词在V槽中进一步重复出现。最后,我主张语用强化[Hopper and Traugott, 1993;Bybee, Perkins和Pagliuca(1994)对c类型中不利意义的出现做出了贡献。希望这项研究可以为语言使用和语法的相互作用以及人类语言的概念基础提供有效的论据。
{"title":"Metaphorical Transfer and Pragmatic Strengthening: On the Development of V-diao in Mandarin","authors":"Louis Wei-lun Lu","doi":"10.30019/IJCLCLP.200108.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200108.0001","url":null,"abstract":"In this synchronic study, I shall adopt a corpus-based approach to investigate the semantic change of V-diao in Mandarin. Semantically, V-diao constructions fall into three categories: A) Physical disappearance from its original position, with the V slot filled by physical verbs, such as tao-diao ”escape,” diu-diao ”throw away,” and so on. B) Disappearance from a certain conceptual domain, rather than from the physical space, with the V slot filled by less physically perceivable verbs, such as jie-diao ”quit,” wang-diao ”forget,” and the like. C) The third category of V-diao involves the speaker's subjective, always negative, attitude toward the result. Examples include: lan-diao ”rot,” ruan-diao ”soften,” huang-diao ”yellow,” and so forth. It is claimed in this paper that the polysemy between types A and B is motivated by metaphorical transfer [Sweetser, 1990; Bybee, Perkins and Pagliuca, 1994; Heine, Claudi and Hunnemeyer, 1991]. Based roughly on Huang and Chang [1996], I demonstrate that a cognitive restriction on selection of the verb will cause further repetitive occurrence of negative verbs in the V slot. Finally, I shall claim that pragmatic strengthening [Hopper and Traugott, 1993; Bybee, Perkins and Pagliuca, 1994] contributes to the emergence of unfavourable meaning in Type C. Hopefully, this research can serve as a valid argument for the interaction of language use and grammar, and the conceptual basis of human language.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125216569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-02-01DOI: 10.30019/IJCLCLP.200102.0002
Jianfeng Gao, Joshua Goodman, J. Miao
Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achieve more than 40% size reduction at the same level of perplexity.
{"title":"The Use of Clustering Techniques for Language Modeling-Application to Asian Language","authors":"Jianfeng Gao, Joshua Goodman, J. Miao","doi":"10.30019/IJCLCLP.200102.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200102.0002","url":null,"abstract":"Cluster-based n-gram modeling is a variant of normal word-based n-gram modeling. It attempts to make use of the similarities between words. In this paper, we present an empirical study of clustering techniques for Asian language modeling. Clustering is used to improve the performance (i.e. perplexity) of language models as well as to compress language models. Experimental tests are presented for cluster-based trigram models on a Japanese newspaper corpus and on a Chinese heterogeneous corpus. While the majority of previous research on word clustering has focused on how to get the best clusters, we have concentrated our research on the best way to use the clusters. Experimental results show that some novel techniques we present work much better than previous methods, and achieve more than 40% size reduction at the same level of perplexity.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129963074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2001-02-01DOI: 10.30019/IJCLCLP.200102.0003
Min Chu, Yao Qian
This paper proposes a three-tier prosodic hierarchy, including prosodic word, intermediate phrase and intonational phrase tiers, for Mandarin that emphasizes the use of the prosodic word instead of the lexical word as the basic prosodic unit. Both the surface difference and perceptual difference show that this is helpful for achieving high naturalness in text-to-speech conversion. Three approaches, the basic CART approach, the bottom-up hierarchical approach and the modified hierarchical approach, are presented for locating the boundaries of three prosodic constituents in unrestricted Mandarin texts. Two sets of features are used in the basic CART method: one contains syntactic phrasal information and the other does not. The one with syntactic phrasal information results in about a 1% increase in accuracy and an 11% decrease in error-cost. The performance of the modified hierarchical method produces the highest accuracy, 83%, and lowest error cost when no syntactic phrasal information is provided. It shows advantages in detecting the boundaries of intonational phrases at locations without breaking punctuation. 71.1% precision and 52.4% recall are achieved. Experiments on acceptability reveal that only 26% of the mis-assigned break indices are real infelicitous errors, and that the perceptual difference between the automatically assigned break indices and the manually annotated break indices are small.
{"title":"Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts","authors":"Min Chu, Yao Qian","doi":"10.30019/IJCLCLP.200102.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200102.0003","url":null,"abstract":"This paper proposes a three-tier prosodic hierarchy, including prosodic word, intermediate phrase and intonational phrase tiers, for Mandarin that emphasizes the use of the prosodic word instead of the lexical word as the basic prosodic unit. Both the surface difference and perceptual difference show that this is helpful for achieving high naturalness in text-to-speech conversion. Three approaches, the basic CART approach, the bottom-up hierarchical approach and the modified hierarchical approach, are presented for locating the boundaries of three prosodic constituents in unrestricted Mandarin texts. Two sets of features are used in the basic CART method: one contains syntactic phrasal information and the other does not. The one with syntactic phrasal information results in about a 1% increase in accuracy and an 11% decrease in error-cost. The performance of the modified hierarchical method produces the highest accuracy, 83%, and lowest error cost when no syntactic phrasal information is provided. It shows advantages in detecting the boundaries of intonational phrases at locations without breaking punctuation. 71.1% precision and 52.4% recall are achieved. Experiments on acceptability reveal that only 26% of the mis-assigned break indices are real infelicitous errors, and that the perceptual difference between the automatically assigned break indices and the manually annotated break indices are small.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131593557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-08-01DOI: 10.30019/IJCLCLP.200008.0003
M. Hasan, Yuji Matsumoto
Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- language information retrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimensionality reduction technique to compute a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate conceptual retrieval of both mono- and cross- language information for these languages. Similar indexing approaches for multiple European languages through term association (e.g., latent semantic indexing) or through conceptual mapping (using lexical ontology such as, WordNet) are being intensively explored. The Interlingua approach investigated here with Japanese and Chinese languages, and the term (or concept) association model investigated with the European languages are similar; and these approaches can be easily integrated. Therefore, the proposed Interlingua model can pave the way for handling multilingual information access and retrieval efficiently and uniformly.
{"title":"Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach","authors":"M. Hasan, Yuji Matsumoto","doi":"10.30019/IJCLCLP.200008.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0003","url":null,"abstract":"Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- language information retrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimensionality reduction technique to compute a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate conceptual retrieval of both mono- and cross- language information for these languages. Similar indexing approaches for multiple European languages through term association (e.g., latent semantic indexing) or through conceptual mapping (using lexical ontology such as, WordNet) are being intensively explored. The Interlingua approach investigated here with Japanese and Chinese languages, and the term (or concept) association model investigated with the European languages are similar; and these approaches can be easily integrated. Therefore, the proposed Interlingua model can pave the way for handling multilingual information access and retrieval efficiently and uniformly.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-08-01DOI: 10.30019/IJCLCLP.200008.0004
R. H. Shih
This paper presents the mechanisms of and criteria for compiling a new learner corpus of English, the quantitative characteristics of the corpus and a practical example of its pedagogical application. The Taiwanese Learner Corpus of English (TLCE), probably the largest annotated learner corpus of English in Taiwan so far, contains 2105 pieces of English writing (around 730,000 words) from Taiwanese college students majoring in English. It is a useful resource for scholars in Second Language Acquisition (SLA) and English Language Teaching (ELT) areas who wish to find out how people in Taiwan learn English and how to help them learn better. The quantitative information shown in the work reflects the characteristics of learner English in terms of part-of-speech distribution, lexical density, and trigram distribution. The usefulness of the corpus is demonstrated by a means of corpus-based investigation of learners' lack of adverbial collocation knowledge.
{"title":"Compiling Taiwanese Learner Corpus of English","authors":"R. H. Shih","doi":"10.30019/IJCLCLP.200008.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0004","url":null,"abstract":"This paper presents the mechanisms of and criteria for compiling a new learner corpus of English, the quantitative characteristics of the corpus and a practical example of its pedagogical application. The Taiwanese Learner Corpus of English (TLCE), probably the largest annotated learner corpus of English in Taiwan so far, contains 2105 pieces of English writing (around 730,000 words) from Taiwanese college students majoring in English. It is a useful resource for scholars in Second Language Acquisition (SLA) and English Language Teaching (ELT) areas who wish to find out how people in Taiwan learn English and how to help them learn better. The quantitative information shown in the work reflects the characteristics of learner English in terms of part-of-speech distribution, lexical density, and trigram distribution. The usefulness of the corpus is demonstrated by a means of corpus-based investigation of learners' lack of adverbial collocation knowledge.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-08-01DOI: 10.30019/IJCLCLP.200008.0002
Jyh-Jong Tsay, Jing-doo Wang
In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors (kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.
{"title":"Design and Evaluation of Approaches for Automatic Chinese Text","authors":"Jyh-Jong Tsay, Jing-doo Wang","doi":"10.30019/IJCLCLP.200008.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0002","url":null,"abstract":"In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors (kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-08-01DOI: 10.30019/IJCLCLP.200008.0001
Jen-Nan Chen
This paper describes a general framework for adaptive conceptual word sense disambiguation. The proposed system begins with knowledge acquisition from machine-readable dictionaries. Central to the approach is the adaptive step that enriches the initial knowledge base with knowledge gleaned from the partial disambiguated text. Once the knowledge base is adjusted to suit the text at hand, it is applied to the text again to finalize the disambiguation decision. Definitions and example sentences from the Longman Dictionary of Contemporary English are employed as training materials for word sense disambiguation, while passages from the Brown corpus and Wall Street Journal (WSJ) articles are used for testing. An experiment showed that adaptation did significantly improve the success rate. For thirteen highly ambiguous words, the proposed method disambiguated with an average precision rate of 70.5% for the Brown corpus and 77.3% for the WSJ articles.
{"title":"Adaptive Word Sense Disambiguation Using Lexical Knowledge in Machine-readable Dictionary","authors":"Jen-Nan Chen","doi":"10.30019/IJCLCLP.200008.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200008.0001","url":null,"abstract":"This paper describes a general framework for adaptive conceptual word sense disambiguation. The proposed system begins with knowledge acquisition from machine-readable dictionaries. Central to the approach is the adaptive step that enriches the initial knowledge base with knowledge gleaned from the partial disambiguated text. Once the knowledge base is adjusted to suit the text at hand, it is applied to the text again to finalize the disambiguation decision. Definitions and example sentences from the Longman Dictionary of Contemporary English are employed as training materials for word sense disambiguation, while passages from the Brown corpus and Wall Street Journal (WSJ) articles are used for testing. An experiment showed that adaptation did significantly improve the success rate. For thirteen highly ambiguous words, the proposed method disambiguated with an average precision rate of 70.5% for the Brown corpus and 77.3% for the WSJ articles.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116544662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-02-01DOI: 10.30019/IJCLCLP.200002.0005
Mei-Chun Liu, Chu-Ren Huang, Charles Lee, Ching-Yi Lee
Since verbal semantics began to receive much attention in linguistics research, many interesting findings have been presented regarding the semantic structure or meaning contrasts in the lexicon of Chinese [cf. Tsai, Huang & Chen, 1996; Tsai et al, 1997; Liu, 1999, etc]. Adopting a corpus-based approach, this paper aims to further study and fine-tune Mandarin verbal semantics by exploring the lexical information specific to verbs of throwing, with four pivotal near-synonomous members: TOU(投), ZHI(擲), DIU(丟), RENG (扔). To account for their semantic differences, two kinds of 'endpoints' are distinguished: the Path-endpoint (i.e., the Goal role) vs. the Event-endpoint (i.e., the resultative state). These two variables are crucial for cross-categorizing the four verbs. Although the verbs all describe a directed motion with a Path in their event structure, they differ in their lexical specifications on participant roles and aspectual composition. TOU and ZHI have a specified Path-endpoint while DIU and RENG do not specify a Path-endpoint. Moreover, TOU and ZHI can be further contrasted in terms of the spatial character of the Path-endpoint they take: TOU selects a spatially bounded Path-endpoint while that of ZHI is unspecified in this regard, as manifested by the fact that TOU collocates most frequently with a CONTAINER-introducing locative. On the other hand, DIU and RENG can be further differentiated in terms of event composition: only DIU, not RENG, allows an aspectual focus on the endpoint of the event contour (the Event-endpoint) since it manifests a resultative use. The observed distinctions are then incorporated into a representational paradigm called the Module-Attribute Representation of Verbal Semantics (MARVS), proposed in Huang & Ahrens [1999]. Finally, conclusions are drawn as to the most effective approach to lexical semantic study of Mandarin as well as theoretical implications in general.
自从语言语义学开始受到语言学研究的关注以来,汉语词汇中的语义结构或意义对比出现了许多有趣的发现[cf. Tsai, Huang & Chen, 1996;Tsai et al ., 1997;刘,1999等]。本文采用基于语料库的方法,通过探索投掷动词特定的词汇信息,进一步研究和微调普通话动词语义,其中有四个关键的近同义词成员:TOU(), ZHI(擲),DIU(), RENG()。为了解释它们的语义差异,区分了两种“端点”:路径端点(即目标角色)和事件端点(即结果状态)。这两个变量对于四个动词的交叉分类至关重要。虽然这些动词都在其事件结构中描述了具有路径的定向运动,但它们在参与者角色和方面组成方面的词汇规范有所不同。TOU和ZHI有指定的Path-endpoint,而DIU和RENG没有指定Path-endpoint。此外,在路径端点的空间特征上,可以进一步对比TOU和ZHI: TOU选择了一个空间有界的路径端点,而ZHI在这方面没有明确的路径端点,这体现在TOU最频繁地与引入container的位置搭配。另一方面,DIU和RENG可以在事件组成方面进一步区分:只有DIU(而不是RENG)允许从方面关注事件轮廓的端点(event -endpoint),因为它体现了结果用法。然后将观察到的差异合并到Huang和Ahrens[1999]提出的表征范式中,称为语言语义的模块属性表示(MARVS)。最后,总结了汉语词汇语义研究的最有效途径及其理论意义。
{"title":"When Endpoint Meets Endpoint: A Corpus-based Lexical Semantic Study of Mandarin Verbs of Throwing","authors":"Mei-Chun Liu, Chu-Ren Huang, Charles Lee, Ching-Yi Lee","doi":"10.30019/IJCLCLP.200002.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200002.0005","url":null,"abstract":"Since verbal semantics began to receive much attention in linguistics research, many interesting findings have been presented regarding the semantic structure or meaning contrasts in the lexicon of Chinese [cf. Tsai, Huang & Chen, 1996; Tsai et al, 1997; Liu, 1999, etc]. Adopting a corpus-based approach, this paper aims to further study and fine-tune Mandarin verbal semantics by exploring the lexical information specific to verbs of throwing, with four pivotal near-synonomous members: TOU(投), ZHI(擲), DIU(丟), RENG (扔). To account for their semantic differences, two kinds of 'endpoints' are distinguished: the Path-endpoint (i.e., the Goal role) vs. the Event-endpoint (i.e., the resultative state). These two variables are crucial for cross-categorizing the four verbs. Although the verbs all describe a directed motion with a Path in their event structure, they differ in their lexical specifications on participant roles and aspectual composition. TOU and ZHI have a specified Path-endpoint while DIU and RENG do not specify a Path-endpoint. Moreover, TOU and ZHI can be further contrasted in terms of the spatial character of the Path-endpoint they take: TOU selects a spatially bounded Path-endpoint while that of ZHI is unspecified in this regard, as manifested by the fact that TOU collocates most frequently with a CONTAINER-introducing locative. On the other hand, DIU and RENG can be further differentiated in terms of event composition: only DIU, not RENG, allows an aspectual focus on the endpoint of the event contour (the Event-endpoint) since it manifests a resultative use. The observed distinctions are then incorporated into a representational paradigm called the Module-Attribute Representation of Verbal Semantics (MARVS), proposed in Huang & Ahrens [1999]. Finally, conclusions are drawn as to the most effective approach to lexical semantic study of Mandarin as well as theoretical implications in general.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132250471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}