Chinese NE (Named Entity) recognition is a difficult problem because of the uncertainty in word segmentation and flexibility in language structure. This paper proposes the use of a rationality model in a multi-agent framework to tackle this problem. We employ a greedy strategy and use the NE rationality model to evaluate and detect all possible NEs in the text. We then treat the process of selecting the best possible NEs as a multi-agent negotiation problem. The resulting system is robust and is able to handle different types of NE effectively. Our test on the MET-2 test corpus indicates that our system is able to achieve high F1 values of above 92% on all NE types.
{"title":"An Agent-based Approach to Chinese Named Entity Recognition","authors":"Shiren Ye, Tat-Seng Chua, Jimin Liu","doi":"10.3115/1072228.1072308","DOIUrl":"https://doi.org/10.3115/1072228.1072308","url":null,"abstract":"Chinese NE (Named Entity) recognition is a difficult problem because of the uncertainty in word segmentation and flexibility in language structure. This paper proposes the use of a rationality model in a multi-agent framework to tackle this problem. We employ a greedy strategy and use the NE rationality model to evaluate and detect all possible NEs in the text. We then treat the process of selecting the best possible NEs as a multi-agent negotiation problem. The resulting system is robust and is able to handle different types of NE effectively. Our test on the MET-2 test corpus indicates that our system is able to achieve high F1 values of above 92% on all NE types.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125777558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kiyotaka Uchimoto, Chikashi Nobata, Atsushi Yamada, S. Sekine, H. Isahara
This paper describes a project tagging a spontaneous speech corpus with morphological information such as word segmentation and parts-of-speech. We use a morphological analysis system based on a maximum entropy model, which is independent of the domain of corpora. In this paper we show the tagging accuracy achieved by using the model and discuss problems in tagging the spontaneous speech corpus. We also show that a dictionary developed for a corpus on a certain domain is helpful for improving accuracy in analyzing a corpus on another domain.
{"title":"Morphological Analysis of the Spontaneous Speech Corpus","authors":"Kiyotaka Uchimoto, Chikashi Nobata, Atsushi Yamada, S. Sekine, H. Isahara","doi":"10.3115/1071884.1071903","DOIUrl":"https://doi.org/10.3115/1071884.1071903","url":null,"abstract":"This paper describes a project tagging a spontaneous speech corpus with morphological information such as word segmentation and parts-of-speech. We use a morphological analysis system based on a maximum entropy model, which is independent of the domain of corpora. In this paper we show the tagging accuracy achieved by using the model and discuss problems in tagging the spontaneous speech corpus. We also show that a dictionary developed for a corpus on a certain domain is helpful for improving accuracy in analyzing a corpus on another domain.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116597959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most current sentence alignment approaches adopt sentence length and cognate as the alignment features; and they are mostly trained and tested in the documents with the same style. Since the length distribution, alignment-type distribution (used by length-based approaches) and cognate frequency vary significantly across texts with different styles, the length-based approaches fail to achieve similar performance when tested in corpora of different styles. The experiments show that the performance in F-measure could drop from 98.2% to 85.6% when a length-based approach is trained by a technical manual and then tested on a general magazine.Since a large percentage of content words in the source text would be translated into the corresponding translation duals to preserve the meaning in the target text, transfer lexicons are usually regarded as more reliable cues for aligning sentences when the alignment task is performed by human. To enhance the robustness, a robust statistical model based on both transfer lexicons and sentence lengths are proposed in this paper. After integrating the transfer lexicons into the model, a 60% F-measure error reduction (from 14.4% to 5.8%) is observed.
{"title":"A Robust Cross-Style Bilingual Sentences Alignment Model","authors":"T. Kueng, Keh-Yih Su","doi":"10.3115/1072228.1072237","DOIUrl":"https://doi.org/10.3115/1072228.1072237","url":null,"abstract":"Most current sentence alignment approaches adopt sentence length and cognate as the alignment features; and they are mostly trained and tested in the documents with the same style. Since the length distribution, alignment-type distribution (used by length-based approaches) and cognate frequency vary significantly across texts with different styles, the length-based approaches fail to achieve similar performance when tested in corpora of different styles. The experiments show that the performance in F-measure could drop from 98.2% to 85.6% when a length-based approach is trained by a technical manual and then tested on a general magazine.Since a large percentage of content words in the source text would be translated into the corresponding translation duals to preserve the meaning in the target text, transfer lexicons are usually regarded as more reliable cues for aligning sentences when the alignment task is performed by human. To enhance the robustness, a robust statistical model based on both transfer lexicons and sentence lengths are proposed in this paper. After integrating the transfer lexicons into the model, a 60% F-measure error reduction (from 14.4% to 5.8%) is observed.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132139760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malte Gabsdil, Alexander Koller, Kristina Striegnitz
We present an engine for text adventures - computer games with which the player interacts using natural language. The system employs current methods from computational linguistics and an efficient inference system for description logic to make the interaction more natural. The inference system is especially useful in the linguistic modules dealing with reference resolution and generation and we show how we use it to rank different readings in the case of referential and syntactic ambiguities. It turns out that the player's utterances are naturally restricted in the game scenario, which simplifies the language processing task.
{"title":"Natural Language and Inference in a Computer Game","authors":"Malte Gabsdil, Alexander Koller, Kristina Striegnitz","doi":"10.3115/1072228.1072341","DOIUrl":"https://doi.org/10.3115/1072228.1072341","url":null,"abstract":"We present an engine for text adventures - computer games with which the player interacts using natural language. The system employs current methods from computational linguistics and an efficient inference system for description logic to make the interaction more natural. The inference system is especially useful in the linguistic modules dealing with reference resolution and generation and we show how we use it to rank different readings in the case of referential and syntactic ambiguities. It turns out that the player's utterances are naturally restricted in the game scenario, which simplifies the language processing task.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"174 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new statistical method called "bilingual chunking" for structure alignment is proposed. Different with the existing approaches which align hierarchical structures like sub-trees, our method conducts alignment on chunks. The alignment is finished through a simultaneous bilingual chunking algorithm. Using the constrains of chunk correspondence between source language (SL) and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm, and lead to highly consistent chunking. Furthermore, by unifying the POS tagging and chunking in the search process, our algorithm alleviates effectively the influence of POS tagging deficiency to the chunking result.The experimental results with English-Chinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment.
{"title":"Structure Alignment Using Bilingual Chunking","authors":"Wei Wang, M. Zhou, Jin-Xia Huang, C. Huang","doi":"10.3115/1072228.1072238","DOIUrl":"https://doi.org/10.3115/1072228.1072238","url":null,"abstract":"A new statistical method called \"bilingual chunking\" for structure alignment is proposed. Different with the existing approaches which align hierarchical structures like sub-trees, our method conducts alignment on chunks. The alignment is finished through a simultaneous bilingual chunking algorithm. Using the constrains of chunk correspondence between source language (SL) and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm, and lead to highly consistent chunking. Furthermore, by unifying the POS tagging and chunking in the search process, our algorithm alleviates effectively the influence of POS tagging deficiency to the chunking result.The experimental results with English-Chinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131852933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Horacio Saggion, Dragomir R. Radev, Simone Teufel, Wai Lam
We describe a framework for the evaluation of summaries in English and Chinese using similarity measures. The framework can be used to evaluate extractive, non-extractive, single and multi-document summarization. We focus on the resources developed that are made available for the research community.
{"title":"Meta-evaluation of Summaries in a Cross-lingual Environment using Content-based Metrics","authors":"Horacio Saggion, Dragomir R. Radev, Simone Teufel, Wai Lam","doi":"10.3115/1072228.1072301","DOIUrl":"https://doi.org/10.3115/1072228.1072301","url":null,"abstract":"We describe a framework for the evaluation of summaries in English and Chinese using similarity measures. The framework can be used to evaluate extractive, non-extractive, single and multi-document summarization. We focus on the resources developed that are made available for the research community.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129036828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large 'general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. We compare the distributional contexts of source and target words, testing several weighting factors and similarity measures. On a test set of frequently occurring words, for the best combination (the Jaccard similarity measure with or without tf.idf weighting), the correct translation is ranked first for 20% of our test words, and is found in the top 10 candidates for 50% of them. An additional reverse-translation filtering step improves the precision of the top candidate translation up to 74%, with a 33% recall.
{"title":"Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora","authors":"Yun-Chuang Chiao, Pierre Zweigenbaum","doi":"10.3115/1071884.1071904","DOIUrl":"https://doi.org/10.3115/1071884.1071904","url":null,"abstract":"Previous attempts at identifying translational equivalents in comparable corpora have dealt with very large 'general language' corpora and words. We address this task in a specialized domain, medicine, starting from smaller non-parallel, comparable corpora and an initial bilingual medical lexicon. We compare the distributional contexts of source and target words, testing several weighting factors and similarity measures. On a test set of frequently occurring words, for the best combination (the Jaccard similarity measure with or without tf.idf weighting), the correct translation is ranked first for 20% of our test words, and is found in the top 10 candidates for 50% of them. An additional reverse-translation filtering step improves the precision of the top candidate translation up to 74%, with a 33% recall.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131196356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we investigate the task of automatically identifying the correct argument structure for a set of verbs. The argument structure of a verb allows us to predict the relationship between the syntactic arguments of a verb and their role in the underlying lexical semantics of the verb. Following the method described in (Merlo and Stevenson, 2001), we exploit the distributions of some selected features from the local context of a verb. These features were extracted from a 23M word WSJ corpus based on part-of-speech tags and phrasal chunks alone. We constructed several decision tree classifiers trained on this data. The best performing classifier achieved an error rate of 33.4%. This work shows that a subcategorization frame (SF) learning algorithm previously applied to Czech (Sarkar and Zeman, 2000) is used to extract SFs in English. The extracted SFs are evaluated by classifying verbs into verb alternation classes.
{"title":"Learning Verb Argument Structure from Minimally Annotated Corpora","authors":"Anoop Sarkar, Woottiporn Tripasai","doi":"10.3115/1072228.1072268","DOIUrl":"https://doi.org/10.3115/1072228.1072268","url":null,"abstract":"In this paper we investigate the task of automatically identifying the correct argument structure for a set of verbs. The argument structure of a verb allows us to predict the relationship between the syntactic arguments of a verb and their role in the underlying lexical semantics of the verb. Following the method described in (Merlo and Stevenson, 2001), we exploit the distributions of some selected features from the local context of a verb. These features were extracted from a 23M word WSJ corpus based on part-of-speech tags and phrasal chunks alone. We constructed several decision tree classifiers trained on this data. The best performing classifier achieved an error rate of 33.4%. This work shows that a subcategorization frame (SF) learning algorithm previously applied to Czech (Sarkar and Zeman, 2000) is used to extract SFs in English. The extracted SFs are evaluated by classifying verbs into verb alternation classes.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130000109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Martínez, Eneko Agirre, Lluís Màrquez i Villodre
This paper explores the contribution of a broad range of syntactic features to WSD: grammatical relations coded as the presence of adjuncts/arguments in isolation or as subcategorization frames, and instantiated grammatical relations between words. We have tested the performance of syntactic features using two different ML algorithms (Decision Lists and AdaBoost) on the Senseval-2 data. Adding syntactic features to a basic set of traditional features improves performance, especially for AdaBoost. In addition, several methods to build arbitrarily high accuracy WSD systems are also tried, showing that syntactic features allow for a precision of 86% and a coverage of 26% or 95% precision and 8% coverage.
{"title":"Syntactic Features for High Precision Word Sense Disambiguation","authors":"David Martínez, Eneko Agirre, Lluís Màrquez i Villodre","doi":"10.3115/1072228.1072340","DOIUrl":"https://doi.org/10.3115/1072228.1072340","url":null,"abstract":"This paper explores the contribution of a broad range of syntactic features to WSD: grammatical relations coded as the presence of adjuncts/arguments in isolation or as subcategorization frames, and instantiated grammatical relations between words. We have tested the performance of syntactic features using two different ML algorithms (Decision Lists and AdaBoost) on the Senseval-2 data. Adding syntactic features to a basic set of traditional features improves performance, especially for AdaBoost. In addition, several methods to build arbitrarily high accuracy WSD systems are also tried, showing that syntactic features allow for a precision of 86% and a coverage of 26% or 95% precision and 8% coverage.","PeriodicalId":437823,"journal":{"name":"Proceedings of the 19th international conference on Computational linguistics -","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134262848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}