首页 > 最新文献

Int. J. Comput. Linguistics Chin. Lang. Process.最新文献

英文 中文
What Can Near Synonyms Tell Us 近义词能告诉我们什么
Pub Date : 2000-02-01 DOI: 10.30019/IJCLCLP.200002.0003
Lian-Cheng Chief, Chu-Ren Huang, Keh-Jiann Chen, Mei-Chih Tsai, Li-Li Chang
This study examines a near synonym pair fangbian and bianli, 'to be convenient,' and extracts the contrasts that dictate their semantic and associated syntactic behaviors. Corpus data reveal important but opaque distributional differences between these synonyms that are not readily apparent based on native speaker intuition. In particular, we argue that this synonym pair can be accounted for with a lexical conceptual profile. This study demonstrates how corpus data can serve as a useful tool for probing the interaction between syntax and semantics.
本研究考察了近义词对“方便”和“方便”,并提取了决定它们的语义和相关句法行为的对比。语料库数据揭示了这些同义词之间重要但不透明的分布差异,这些差异基于母语人士的直觉并不容易明显。特别是,我们认为这对同义词可以用词汇概念概况来解释。本研究展示了语料库数据如何作为一种有用的工具来探索语法和语义之间的相互作用。
{"title":"What Can Near Synonyms Tell Us","authors":"Lian-Cheng Chief, Chu-Ren Huang, Keh-Jiann Chen, Mei-Chih Tsai, Li-Li Chang","doi":"10.30019/IJCLCLP.200002.0003","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200002.0003","url":null,"abstract":"This study examines a near synonym pair fangbian and bianli, 'to be convenient,' and extracts the contrasts that dictate their semantic and associated syntactic behaviors. Corpus data reveal important but opaque distributional differences between these synonyms that are not readily apparent based on native speaker intuition. In particular, we argue that this synonym pair can be accounted for with a lexical conceptual profile. This study demonstrates how corpus data can serve as a useful tool for probing the interaction between syntax and semantics.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121784399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
The Module-Attribute Representation of Verbal Semantics: From Semantic to Argument Structure 语言语义的模块-属性表示:从语义到参数结构
Pub Date : 2000-02-01 DOI: 10.30019/IJCLCLP.200002.0002
Chu-Ren Huang, K. Ahrens, Li-Li Chang, Keh-Jiann Chen, Mei-Chun Liu, Mei-Chih Tsai
In this paper, we set forth a theory of lexical knowledge. We propose two types of modules: event structure modules and role modules, as well as two sets of attributes: event-internal attributes and role-internal attributes, which are linked to the event structure module and role module, respectively. These module-attribute semantic representations have associated grammatical consequences. Our data is drawn from a comprehensive corpus-based study of Mandarin Chinese verbal semantics, and four particular case studies are presented.
本文提出了一种词汇知识理论。我们提出了两类模块:事件结构模块和角色模块,以及两组属性:事件内部属性和角色内部属性,它们分别链接到事件结构模块和角色模块。这些模块-属性语义表示具有相关的语法结果。我们的数据来自基于语料库的汉语语言语义综合研究,并提出了四个特定的案例研究。
{"title":"The Module-Attribute Representation of Verbal Semantics: From Semantic to Argument Structure","authors":"Chu-Ren Huang, K. Ahrens, Li-Li Chang, Keh-Jiann Chen, Mei-Chun Liu, Mei-Chih Tsai","doi":"10.30019/IJCLCLP.200002.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200002.0002","url":null,"abstract":"In this paper, we set forth a theory of lexical knowledge. We propose two types of modules: event structure modules and role modules, as well as two sets of attributes: event-internal attributes and role-internal attributes, which are linked to the event structure module and role module, respectively. These module-attribute semantic representations have associated grammatical consequences. Our data is drawn from a comprehensive corpus-based study of Mandarin Chinese verbal semantics, and four particular case studies are presented.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129884684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
A Model for Word Sense Disambiguation 一种词义消歧模型
Pub Date : 1999-08-01 DOI: 10.30019/IJCLCLP.199908.0001
Juan-Zi Li, C. Huang
Word sense disambiguation is one of the most difficult problems in natural language processing. This paper puts forward a model for mapping a structural semantic space from a thesaurus into a multi-dimensional, real-valued vector space and gives a word sense disambiguation method based on this mapping. The model, which uses an unsupervised learning method to acquire the disambiguation knowledge, not only saves extensive manual work, but also realizes the sense tagging of a large number of content words. Firstly, a Chinese thesaurus Cilin and a very large-scale corpus are used to construct the structure of the semantic space. Then, a dynamic disambiguation model is developed to disambiguate an ambiguous word according to the vectors of monosemous words in each of its possible categories. In order to resolve the problem of data sparseness, a method is proposed to make the model more robust. Testing results show that the model has relatively good performance and can also be used for other languages.
词义消歧是自然语言处理中最困难的问题之一。本文提出了一种将义表结构语义空间映射到多维实值向量空间的模型,并给出了基于该映射的词义消歧方法。该模型采用无监督学习的方法获取消歧知识,不仅节省了大量的人工工作,而且实现了对大量实词的意义标注。首先,利用汉语同义词典和超大规模语料库构建语义空间结构。然后,建立了一个动态消歧模型,根据每个可能类别中单义词的向量来消歧歧义词。为了解决数据稀疏性问题,提出了一种增强模型鲁棒性的方法。测试结果表明,该模型具有较好的性能,也可用于其他语言。
{"title":"A Model for Word Sense Disambiguation","authors":"Juan-Zi Li, C. Huang","doi":"10.30019/IJCLCLP.199908.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199908.0001","url":null,"abstract":"Word sense disambiguation is one of the most difficult problems in natural language processing. This paper puts forward a model for mapping a structural semantic space from a thesaurus into a multi-dimensional, real-valued vector space and gives a word sense disambiguation method based on this mapping. The model, which uses an unsupervised learning method to acquire the disambiguation knowledge, not only saves extensive manual work, but also realizes the sense tagging of a large number of content words. Firstly, a Chinese thesaurus Cilin and a very large-scale corpus are used to construct the structure of the semantic space. Then, a dynamic disambiguation model is developed to disambiguate an ambiguous word according to the vectors of monosemous words in each of its possible categories. In order to resolve the problem of data sparseness, a method is proposed to make the model more robust. Testing results show that the model has relatively good performance and can also be used for other languages.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129172151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus 基于大型汉语语料库的汉语语音单元统计分析及语音丰富句子自动提取
Pub Date : 1998-08-01 DOI: 10.30019/IJCLCLP.199808.0005
H. Wang
Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However, it is believed that an adequate speech database for training acoustic models and evaluating performance is certainly critical for successful deployment of such systems in realistic operating environments. Thus, designing a set of phonetically rich sentences to be used in efficiently training and evaluating a speech recognition system has become very important. This paper first presents statistical analysis of various Mandarin acoustic units based upon a very large Chinese text corpus collected from daily newspapers and then presents an algorithm to automatically extract phonetically rich sentences from the text corpus to be used in training and evaluating a Mandarin speech recognition system.
计算机的自动语音识别为人类与计算机的交流提供了最便捷的方式。由于中文不是按字母顺序排列的,而且将汉字输入计算机非常困难,因此对普通话语音识别的需求非常高。最近,高性能的语音识别系统开始在研究机构中出现。然而,人们认为,一个足够的语音数据库来训练声学模型和评估性能,对于在实际操作环境中成功部署此类系统至关重要。因此,设计一组语音丰富的句子来有效地训练和评估语音识别系统变得非常重要。本文首先对从日报中收集的大量汉语文本语料库进行了各种普通话声学单位的统计分析,然后提出了一种从文本语料库中自动提取语音丰富句子的算法,用于普通话语音识别系统的训练和评估。
{"title":"Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus","authors":"H. Wang","doi":"10.30019/IJCLCLP.199808.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199808.0005","url":null,"abstract":"Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However, it is believed that an adequate speech database for training acoustic models and evaluating performance is certainly critical for successful deployment of such systems in realistic operating environments. Thus, designing a set of phonetically rich sentences to be used in efficiently training and evaluating a speech recognition system has become very important. This paper first presents statistical analysis of various Mandarin acoustic units based upon a very large Chinese text corpus collected from daily newspapers and then presents an algorithm to automatically extract phonetically rich sentences from the text corpus to be used in training and evaluating a Mandarin speech recognition system.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"235 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127876508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
White Page Construction from Web Pages for Finding People on the Internet 在互联网上寻找人的网页的白页建设
Pub Date : 1998-02-01 DOI: 10.30019/IJCLCLP.199802.0005
Hsin-Hsi Chen, Guo-Wei Bian
This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The information (i.e., home pages' URLs or e-mail addresses) for those proper nouns appearing in the anchor parts can be easily extracted using the associated anchor tags. For those proper nouns in the non-anchor pan of a web page, different kinds of clues, such as the spelling method, adjacency principle and HTML tags, are used to relate proper nouns to their corresponding E-mail addresses and/or URLs. Based on the semantics of content and HTML tags, the extracted information is more accurate than the results obtained using traditional search engines. The results can be used to construct white pages for Internet/Intranet users or to build databases for finding people and organizations on the Internet. Such searching services are very useful for human communication and dissemination of information.
本文提出了一种针对Internet/Intranet用户从网页中自动提取专有名称及其相关信息的方法。从万维网文档中提取的信息包括专有名词、电子邮件地址和主页url。专有名词的识别和分类通常采用自然语言处理技术。可以使用相关的锚标记轻松地提取锚部分中出现的专有名词的信息(即,主页的url或电子邮件地址)。对于网页非锚区中的专有名词,通过拼写法、邻接原则、HTML标签等不同的线索,将专有名词与相应的E-mail地址和/或url联系起来。基于内容和HTML标签的语义,提取的信息比使用传统搜索引擎获得的结果更准确。结果可用于为Internet/Intranet用户构建白页,或用于在Internet上查找人员和组织建立数据库。这样的搜索服务对于人类的交流和信息传播是非常有用的。
{"title":"White Page Construction from Web Pages for Finding People on the Internet","authors":"Hsin-Hsi Chen, Guo-Wei Bian","doi":"10.30019/IJCLCLP.199802.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199802.0005","url":null,"abstract":"This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The information (i.e., home pages' URLs or e-mail addresses) for those proper nouns appearing in the anchor parts can be easily extracted using the associated anchor tags. For those proper nouns in the non-anchor pan of a web page, different kinds of clues, such as the spelling method, adjacency principle and HTML tags, are used to relate proper nouns to their corresponding E-mail addresses and/or URLs. Based on the semantics of content and HTML tags, the extracted information is more accurate than the results obtained using traditional search engines. The results can be used to construct white pages for Internet/Intranet users or to build databases for finding people and organizations on the Internet. Such searching services are very useful for human communication and dissemination of information.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114466844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Building a Bracketed Corpus Using Φ2 Statistics 使用Φ2 Statistics构建括号语料库
Pub Date : 1997-08-01 DOI: 10.30019/IJCLCLP.199708.0001
Yue-Shi Lee, Hsin-Hsi Chen
Research based on treebanks is ongoing for many natural language applications. However, the work involved in building a large-scale treebank is laborious and time-consuming. Thus, speeding up the process of building a treebank has become an important task. This paper proposes two versions of probabilistic chunkers to aid the development of a bracketed corpus. The basic version partitions part-of-speech sequences into chunk sequences, which form a partially bracketed corpus. Applying the chunking action recursively, the recursive version generates a fully bracketed corpus. Rather than using a treebank as a training corpus, a corpus, which is tagged with part-of-speech information only, is used. The experimental results show that the probabilistic chunker has a correct rate of more than 94% in producing a partially bracketed corpus and also gives very encouraging results in generating a fully bracketed corpus. These two versions of chunkers are simple but effective and can also be applied to many natural language applications.
基于树库的研究正在许多自然语言应用中进行。然而,建造一个大型树木库的工作既费力又耗时。因此,加快建设树库的进程已成为一项重要任务。本文提出了两个版本的概率分块器来帮助括号语料库的开发。基本版本将词性序列划分为块序列,形成部分括号语料库。递归地应用分块操作,递归版本生成一个完全带括号的语料库。不是使用树库作为训练语料库,而是使用仅标记词性信息的语料库。实验结果表明,概率分块器在生成部分括号语料库方面的正确率超过94%,在生成完全括号语料库方面也取得了令人鼓舞的结果。这两个版本的分块器简单但有效,也可以应用于许多自然语言应用程序。
{"title":"Building a Bracketed Corpus Using Φ2 Statistics","authors":"Yue-Shi Lee, Hsin-Hsi Chen","doi":"10.30019/IJCLCLP.199708.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199708.0001","url":null,"abstract":"Research based on treebanks is ongoing for many natural language applications. However, the work involved in building a large-scale treebank is laborious and time-consuming. Thus, speeding up the process of building a treebank has become an important task. This paper proposes two versions of probabilistic chunkers to aid the development of a bracketed corpus. The basic version partitions part-of-speech sequences into chunk sequences, which form a partially bracketed corpus. Applying the chunking action recursively, the recursive version generates a fully bracketed corpus. Rather than using a treebank as a training corpus, a corpus, which is tagged with part-of-speech information only, is used. The experimental results show that the probabilistic chunker has a correct rate of more than 94% in producing a partially bracketed corpus and also gives very encouraging results in generating a fully bracketed corpus. These two versions of chunkers are simple but effective and can also be applied to many natural language applications.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129055422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Representation of Verbal Semantics – An Approach Based on Near-Synonyms 一种基于近义词的语言语义表达方法
Pub Date : 1997-08-01 DOI: 10.30019/IJCLCLP.199802.0004
Mei-Chih Tsai, Chu-Ren Huang, Keh-Jiann Chen, K. Ahrens
In this paper we propose using the distributional differences in the syntactic patterns of near-synonyms to deduce the relevant components of verb meaning. Our method involves determining the distributional differences in syntactic patterns, deducing the semantic features from the syntactic phenomena, and testing the semantic features in new syntactic frames. We determine the distributional differences in syntactic patterns through the following five steps: First, we search for all instances of the verb in the corpus. Second, we classify each of these instances into its type of syntactic function. Third, we classify each of these instances into its argument structure type. Fourth, we determine the aspectual type that is associated with each verb. Lastly, we determine each verb's sentential type. Once the distributional differences have been determined, then the relevant semantic features are postulated. Our goal is to tease out the lexical semantic features as the explanation, and as the motivation of the syntactic contrasts.
本文提出利用近义词句法模式的分布差异来推断动词意义的相关成分。我们的方法包括确定句法模式的分布差异,从句法现象推断语义特征,并在新的句法框架中测试语义特征。我们通过以下五个步骤确定句法模式的分布差异:首先,我们在语料库中搜索动词的所有实例。其次,我们将这些实例按其语法功能类型进行分类。第三,我们将每个实例按其参数结构类型进行分类。第四,我们确定与每个动词相关联的方面类型。最后,我们确定每个动词的句子类型。一旦确定了分布差异,就可以假设相关的语义特征。我们的目标是梳理出词汇语义特征,作为句法对比的解释和理据。
{"title":"Towards a Representation of Verbal Semantics – An Approach Based on Near-Synonyms","authors":"Mei-Chih Tsai, Chu-Ren Huang, Keh-Jiann Chen, K. Ahrens","doi":"10.30019/IJCLCLP.199802.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199802.0004","url":null,"abstract":"In this paper we propose using the distributional differences in the syntactic patterns of near-synonyms to deduce the relevant components of verb meaning. Our method involves determining the distributional differences in syntactic patterns, deducing the semantic features from the syntactic phenomena, and testing the semantic features in new syntactic frames. We determine the distributional differences in syntactic patterns through the following five steps: First, we search for all instances of the verb in the corpus. Second, we classify each of these instances into its type of syntactic function. Third, we classify each of these instances into its argument structure type. Fourth, we determine the aspectual type that is associated with each verb. Lastly, we determine each verb's sentential type. Once the distributional differences have been determined, then the relevant semantic features are postulated. Our goal is to tease out the lexical semantic features as the explanation, and as the motivation of the syntactic contrasts.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"49 17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116998070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
An Unsupervised Iterative Method for Chinese New Lexicon Extraction 中文新词典抽取的无监督迭代方法
Pub Date : 1997-08-01 DOI: 10.30019/IJCLCLP.199708.0005
Jing-Shin Chang, Keh-Yih Su
An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-merging-filtering-and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input corpus. On the other hand, the joint character association metric (which reflects the global character association characteristics across the corpus) is derived by integrating several commonly used word association metrics, such as mutual information and entropy, with a joint Gaussian mixture density function; such integration allows the filter to use multiple features simultaneously to evaluate character association, unlike traditional filters which apply multiple features independently. The proposed method then allows the contextual constraints and the joint character association metric to enhance each other; this is achieved by iteratively applying the joint association metric to truncate unlikely unknown words in the augmented dictionary and using the segmentation result to improve the estimation of the joint association metric. The refined augmented dictionary and improved estimation are then used in the next iteration to acquire better segmentation and carry out more reliable filtering. Experiments show that both the precision and recall rates are improved almost monotonically, in contrast to non-iterative segmentation-merging-filtering-and-disambiguation approaches, which often sacrifice precision for recall or vice versa. With a corpus of 311,591 sentences, the performance is 76% (bigram), 54% (trigram), and 70% (quadragram) in F-measure, which is significantly better than using the non-iterative approach with F-measures of 74% (bigram), 46% (trigram), and 58% (quadragram).
提出了一种从汉语文本语料库中提取新词汇(或未知词)的无监督迭代方法。该方法不是使用非迭代的分割-合并-过滤和消歧方法,而是迭代地集成上下文约束(在候选词之间)和联合字符关联度量,以逐步改进输入语料库(从而改进新词表)的分割结果。与传统的只使用已知词进行分割的方法不同,该方法使用增强型词典,其中包括潜在的未知词(除了已知词)来分割输入语料库。在分词过程中,利用增强词典对输入句子中的已知词和潜在未知词施加上下文约束;然后应用无监督Viterbi训练过程来确保所选的潜在未知单词(和已知单词)最大化输入语料库的可能性。另一方面,通过联合高斯混合密度函数对互信息和熵等常用词关联度量进行综合,得到联合词关联度量(反映整个语料库的全局词关联特征);这种集成允许过滤器同时使用多个特征来评估字符关联,而不像传统过滤器单独应用多个特征。该方法允许上下文约束和联合特征关联度量相互增强;这是通过迭代地应用联合关联度量来截断增强字典中不太可能的未知单词,并使用分割结果来改进联合关联度量的估计来实现的。然后在下一次迭代中使用改进的增强字典和改进的估计来获得更好的分割和更可靠的滤波。实验表明,与非迭代分割-合并-滤波-消歧义方法相比,该方法的准确率和召回率几乎是单调提高的,而非迭代分割-合并-滤波-消歧义方法往往会牺牲召回率的精度,反之亦然。在包含311,591个句子的语料库中,F-measure的性能分别为76%(双格)、54%(三格)和70%(四格),明显优于使用非迭代方法,F-measure的性能分别为74%(双格)、46%(三格)和58%(四格)。
{"title":"An Unsupervised Iterative Method for Chinese New Lexicon Extraction","authors":"Jing-Shin Chang, Keh-Yih Su","doi":"10.30019/IJCLCLP.199708.0005","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199708.0005","url":null,"abstract":"An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-merging-filtering-and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input corpus. On the other hand, the joint character association metric (which reflects the global character association characteristics across the corpus) is derived by integrating several commonly used word association metrics, such as mutual information and entropy, with a joint Gaussian mixture density function; such integration allows the filter to use multiple features simultaneously to evaluate character association, unlike traditional filters which apply multiple features independently. The proposed method then allows the contextual constraints and the joint character association metric to enhance each other; this is achieved by iteratively applying the joint association metric to truncate unlikely unknown words in the augmented dictionary and using the segmentation result to improve the estimation of the joint association metric. The refined augmented dictionary and improved estimation are then used in the next iteration to acquire better segmentation and carry out more reliable filtering. Experiments show that both the precision and recall rates are improved almost monotonically, in contrast to non-iterative segmentation-merging-filtering-and-disambiguation approaches, which often sacrifice precision for recall or vice versa. With a corpus of 311,591 sentences, the performance is 76% (bigram), 54% (trigram), and 70% (quadragram) in F-measure, which is significantly better than using the non-iterative approach with F-measures of 74% (bigram), 46% (trigram), and 58% (quadragram).","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117249587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Computational Tools and Resources for Linguistic Studies 语言学研究的计算工具和资源
Pub Date : 1997-02-01 DOI: 10.30019/IJCLCLP.199702.0001
Y. Hsu, Jing-Shin Chang, Keh-Yih Su
This paper presents several useful computational tools and available resources to facilitate linguistic studies. For each computational tool, we demonstrate why it is useful and how can it be used for research. In addition, linguistic examples are given for illustration. First, a very useful searching engine, Key Word in Context (KWIC), is introduced. This tool can automatically extract linguistically significant patterns from large corpora and help linguists discover syntagmatic generalizations. Second, Dynamic Clustering and Hierarchical Clustering are introduced for identifying natural clusters of words or phrases in distribution. Third, statistical measures which could be used to measure the degree of cohesion and correlation among linguistic units are presented. These tools can help linguists identify the boundaries of lexical units. Fourth, alignment tools for aligning parallel texts at the word, sentence and structure levels are presented for linguists who do comparative studies of different languages. Fifth, we introduce Sequential Forward Selection (SFS) and Classification and Regression Tree (CART) for automatic rule ordering. Finally, some available electronic Chinese resources are described to provide reference purposes for those who are interested.
本文介绍了几个有用的计算工具和可用资源,以促进语言学研究。对于每一种计算工具,我们都会说明为什么它是有用的,以及如何将其用于研究。此外,还给出了语言实例来说明。首先,介绍了一个非常有用的搜索引擎——上下文关键词(KWIC)。该工具可以自动从大型语料库中提取语言上重要的模式,并帮助语言学家发现组合概括。其次,引入动态聚类和层次聚类来识别分布中的词或短语的自然聚类。第三,提出了衡量语言单位间衔接和关联程度的统计方法。这些工具可以帮助语言学家识别词汇单位的边界。第四,为从事不同语言比较研究的语言学家提供了在单词、句子和结构层面对平行文本进行对齐的工具。第五,我们引入了顺序正向选择(SFS)和分类回归树(CART)来实现规则的自动排序。最后,介绍了一些可用的电子中文资源,以供感兴趣的人参考。
{"title":"Computational Tools and Resources for Linguistic Studies","authors":"Y. Hsu, Jing-Shin Chang, Keh-Yih Su","doi":"10.30019/IJCLCLP.199702.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199702.0001","url":null,"abstract":"This paper presents several useful computational tools and available resources to facilitate linguistic studies. For each computational tool, we demonstrate why it is useful and how can it be used for research. In addition, linguistic examples are given for illustration. First, a very useful searching engine, Key Word in Context (KWIC), is introduced. This tool can automatically extract linguistically significant patterns from large corpora and help linguists discover syntagmatic generalizations. Second, Dynamic Clustering and Hierarchical Clustering are introduced for identifying natural clusters of words or phrases in distribution. Third, statistical measures which could be used to measure the degree of cohesion and correlation among linguistic units are presented. These tools can help linguists identify the boundaries of lexical units. Fourth, alignment tools for aligning parallel texts at the word, sentence and structure levels are presented for linguists who do comparative studies of different languages. Fifth, we introduce Sequential Forward Selection (SFS) and Classification and Regression Tree (CART) for automatic rule ordering. Finally, some available electronic Chinese resources are described to provide reference purposes for those who are interested.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114555418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Applications 不同语音群体同步汉语语料库的构建与应用
Pub Date : 1997-02-01 DOI: 10.30019/IJCLCLP.199702.0004
B. K. T'sou, Hing-lung Lin, Godfrey Liu, Terence Y. W. Chan, Jerome Hu, Ching-hai Chew, John K. P. Tse
Similar to other languages such as English, Spanish and Arabic, Chinese is used by a large number of speakers in distinct speech communities which, despite sharing the unity of language, vary in interesting ways, and a systematic study of such linguistic variation is invaluable to appreciate the diversity and richness of the underlying cultures. This paper describes Project LIVAC (Linguistic Variation in Chinese Communities), which focuses on the development of a Chinese corpus, based on data taken concurrently at regular intervals from multiple Chinese speech communities. The resulting database and computerized concordance from the approximately 20 million word corpus with uniform time reference points extending across two years enable linguists and social scientists to undertake meaningful qualitative and quantitative comparative analysis of the development of linguistic and cultural variation. To facilitate these studies, a framework for integrating the corpus with specific corpus analysis applications is proposed. Based on this framework, a prototype retrieval system, which supports longitudinal studies on word and concept distribution, as well as lexical and other linguistic variation, is designed and implemented.
与英语、西班牙语和阿拉伯语等其他语言类似,汉语是由不同语言群体的大量使用者使用的,尽管这些语言是统一的,但却以有趣的方式变化着,对这种语言变化的系统研究对于欣赏潜在文化的多样性和丰富性是非常宝贵的。本文介绍了LIVAC (Chinese Communities Linguistic Variation in Chinese Communities)项目,该项目侧重于基于定期从多个汉语语音社区同时采集的数据开发汉语语料库。从大约2000万词的语料库中得到的数据库和计算机化的一致性,具有跨越两年的统一时间参考点,使语言学家和社会科学家能够对语言和文化变异的发展进行有意义的定性和定量比较分析。为了促进这些研究,提出了一个将语料库与特定语料库分析应用程序集成的框架。基于该框架,设计并实现了一个支持单词和概念分布、词汇和其他语言变化纵向研究的原型检索系统。
{"title":"A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Applications","authors":"B. K. T'sou, Hing-lung Lin, Godfrey Liu, Terence Y. W. Chan, Jerome Hu, Ching-hai Chew, John K. P. Tse","doi":"10.30019/IJCLCLP.199702.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.199702.0004","url":null,"abstract":"Similar to other languages such as English, Spanish and Arabic, Chinese is used by a large number of speakers in distinct speech communities which, despite sharing the unity of language, vary in interesting ways, and a systematic study of such linguistic variation is invaluable to appreciate the diversity and richness of the underlying cultures. This paper describes Project LIVAC (Linguistic Variation in Chinese Communities), which focuses on the development of a Chinese corpus, based on data taken concurrently at regular intervals from multiple Chinese speech communities. The resulting database and computerized concordance from the approximately 20 million word corpus with uniform time reference points extending across two years enable linguists and social scientists to undertake meaningful qualitative and quantitative comparative analysis of the development of linguistic and cultural variation. To facilitate these studies, a framework for integrating the corpus with specific corpus analysis applications is proposed. Based on this framework, a prototype retrieval system, which supports longitudinal studies on word and concept distribution, as well as lexical and other linguistic variation, is designed and implemented.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117023005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
Int. J. Comput. Linguistics Chin. Lang. Process.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1