首页 > 最新文献

ACL Workshop on Natural Language Processing in the Biomedical Domain最新文献

英文 中文
MPLUS: a probabilistic medical language understanding system MPLUS:概率医学语言理解系统
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118154
Lee M. Christensen, P. Haug, M. Fiszman
This paper describes the basic philosophy and implementation of MPLUS (M+), a robust medical text analysis tool that uses a semantic model based on Bayesian Networks (BNs). BNs provide a concise and useful formalism for representing semantic patterns in medical text, and for recognizing and reasoning over those patterns. BNs are noise-tolerant, and facilitate the training of M+.
本文描述了MPLUS (M+)的基本原理和实现,MPLUS是一种鲁棒的医学文本分析工具,它使用基于贝叶斯网络(BNs)的语义模型。bn为表示医学文本中的语义模式以及对这些模式的识别和推理提供了一种简洁而有用的形式。bn具有抗噪性,并有利于M+的训练。
{"title":"MPLUS: a probabilistic medical language understanding system","authors":"Lee M. Christensen, P. Haug, M. Fiszman","doi":"10.3115/1118149.1118154","DOIUrl":"https://doi.org/10.3115/1118149.1118154","url":null,"abstract":"This paper describes the basic philosophy and implementation of MPLUS (M+), a robust medical text analysis tool that uses a semantic model based on Bayesian Networks (BNs). BNs provide a concise and useful formalism for representing semantic patterns in medical text, and for recognizing and reasoning over those patterns. BNs are noise-tolerant, and facilitate the training of M+.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115892753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Medstract: creating large-scale information servers from biomedical texts 摘要:从生物医学文本中创建大规模信息服务器
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118161
J. Pustejovsky, J. Castaño, Jason Zhang, R. Saurí, W. Luo
The automatic extraction of information from Medline articles and abstracts (commonly referred to now as the biobibliome) promises to play an increasingly critical role in aiding research while speeding up the discovery process. We have been developing robust natural language tools for the automated extraction of structured information from biomedical texts as part of a project we call MEDSTRACT. Here we will describe an architecture for developing databases for domain specific information servers for research and support in the biomedical community. These are currently comprised of the following: a Bio-Relation Server, and the Bio-Acronym server, Acromed, which will include also aliases. Each information server is derived automatically from an integration of diverse components which employ robust natural language processing of Medline text and IE techniques. The front-end consists of conventional search and navigation capabilities, as well as visualization tools that help to navigate the databases and explore the results of a search. It is hoped that this set of applications will allow for quick, structured access to relevant information on individual genes by biologists over the web.
从Medline文章和摘要中自动提取信息(现在通常称为biobibliome)有望在加速发现过程的同时,在辅助研究方面发挥越来越重要的作用。作为MEDSTRACT项目的一部分,我们一直在开发健壮的自然语言工具,用于从生物医学文本中自动提取结构化信息。在这里,我们将描述为生物医学社区的研究和支持领域特定信息服务器开发数据库的体系结构。这些目前包括以下内容:生物关系服务器和生物缩略词服务器,Acromed,其中还包括别名。每个信息服务器都是由不同组件的集成自动生成的,这些组件采用了Medline文本和IE技术的健壮的自然语言处理。前端包括传统的搜索和导航功能,以及帮助导航数据库和探索搜索结果的可视化工具。人们希望这组应用程序将允许生物学家通过网络快速、结构化地访问个体基因的相关信息。
{"title":"Medstract: creating large-scale information servers from biomedical texts","authors":"J. Pustejovsky, J. Castaño, Jason Zhang, R. Saurí, W. Luo","doi":"10.3115/1118149.1118161","DOIUrl":"https://doi.org/10.3115/1118149.1118161","url":null,"abstract":"The automatic extraction of information from Medline articles and abstracts (commonly referred to now as the biobibliome) promises to play an increasingly critical role in aiding research while speeding up the discovery process. We have been developing robust natural language tools for the automated extraction of structured information from biomedical texts as part of a project we call MEDSTRACT. Here we will describe an architecture for developing databases for domain specific information servers for research and support in the biomedical community. These are currently comprised of the following: a Bio-Relation Server, and the Bio-Acronym server, Acromed, which will include also aliases. Each information server is derived automatically from an integration of diverse components which employ robust natural language processing of Medline text and IE techniques. The front-end consists of conventional search and navigation capabilities, as well as visualization tools that help to navigate the databases and explore the results of a search. It is hoped that this set of applications will allow for quick, structured access to relevant information on individual genes by biologists over the web.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134322581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Contrast and variability in gene names 基因名称的对比和变异
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118152
K. B. Cohen, A. Dolbey, G. Acquaah-Mensah, L. Hunter
We studied contrast and variability in a corpus of gene names to identify potential heuristics for use in performing entity identification in the molecular biology domain. Based on our findings, we developed heuristics for mapping weakly matching gene names to their official gene names. We then tested these heuristics against a large body of Medline abstracts, and found that using these heuristics can increase recall, with varying levels of precision. Our findings also underscored the importance of good information retrieval and of the ability to disambiguate between genes, proteins, RNA, and a variety of other referents for performing entity identification with high precision.
我们研究了基因名称语料库中的对比和可变性,以确定在分子生物学领域进行实体识别时使用的潜在启发式方法。基于我们的发现,我们开发了将弱匹配基因名称映射到其官方基因名称的启发式方法。然后,我们针对Medline的大量摘要测试了这些启发式方法,发现使用这些启发式方法可以提高召回率,并具有不同程度的精度。我们的研究结果还强调了良好的信息检索和消除基因、蛋白质、RNA和各种其他参考物之间歧义的能力对于进行高精度实体鉴定的重要性。
{"title":"Contrast and variability in gene names","authors":"K. B. Cohen, A. Dolbey, G. Acquaah-Mensah, L. Hunter","doi":"10.3115/1118149.1118152","DOIUrl":"https://doi.org/10.3115/1118149.1118152","url":null,"abstract":"We studied contrast and variability in a corpus of gene names to identify potential heuristics for use in performing entity identification in the molecular biology domain. Based on our findings, we developed heuristics for mapping weakly matching gene names to their official gene names. We then tested these heuristics against a large body of Medline abstracts, and found that using these heuristics can increase recall, with varying levels of precision. Our findings also underscored the importance of good information retrieval and of the ability to disambiguate between genes, proteins, RNA, and a variety of other referents for performing entity identification with high precision.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126167425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
A transformational-based learner for dependency grammars in discharge summaries 基于转换的出院摘要依赖性语法学习器
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118155
D. A. Campbell, Stephen B. Johnson
NLP systems will be more portable among medical domains if acquisition of semantic lexicons can be facilitated. We are pursuing lexical acquisition through the syntactic relationships of words in medical corpora. Therefore we require a syntactic parser which is flexible, portable, captures head-modifier pairs and does not require a large training set. We have designed a dependency grammar parser that learns through a transformational-based algorithm. We propose a novel design for templates and transformations which capitalize on the dependency structure directly and produces human-readable rules. Our parser achieved a 77% accurate parse training on only 830 sentences. Further work will evaluate the usefulness of this parse for lexical acquisition.
如果语义词汇的获取能够得到促进,自然语言处理系统将在医学领域中更加便携。我们通过医学语料库中词汇的句法关系来研究词汇习得。因此,我们需要一个灵活的、可移植的、捕获头部修饰符对并且不需要大型训练集的语法解析器。我们设计了一个依赖语法解析器,它通过基于转换的算法进行学习。我们提出了一种新的模板和转换设计,它直接利用依赖结构并产生人类可读的规则。我们的解析器仅对830个句子进行了77%的准确解析训练。进一步的工作将评估这种解析对词汇习得的有用性。
{"title":"A transformational-based learner for dependency grammars in discharge summaries","authors":"D. A. Campbell, Stephen B. Johnson","doi":"10.3115/1118149.1118155","DOIUrl":"https://doi.org/10.3115/1118149.1118155","url":null,"abstract":"NLP systems will be more portable among medical domains if acquisition of semantic lexicons can be facilitated. We are pursuing lexical acquisition through the syntactic relationships of words in medical corpora. Therefore we require a syntactic parser which is flexible, portable, captures head-modifier pairs and does not require a large training set. We have designed a dependency grammar parser that learns through a transformational-based algorithm. We propose a novel design for templates and transformations which capitalize on the dependency structure directly and produces human-readable rules. Our parser achieved a 77% accurate parse training on only 830 sentences. Further work will evaluate the usefulness of this parse for lexical acquisition.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128601169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Enhanced natural language access to anatomically-indexed data 增强了对解剖学索引数据的自然语言访问
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118156
Gail Sinclair, B. Webber, D. Davidson
We describe our use of an existing resource, the Mouse Anatomical Nomenclature, to improve a symbolic interface to anatomically-indexed gene expression data. The goal is to reduce user effort in specifying anatomical structures of interest and increase precision and recall.
我们描述了我们对现有资源的使用,小鼠解剖命名法,以改进解剖学索引基因表达数据的符号接口。目标是减少用户指定感兴趣的解剖结构的工作量,提高准确性和召回率。
{"title":"Enhanced natural language access to anatomically-indexed data","authors":"Gail Sinclair, B. Webber, D. Davidson","doi":"10.3115/1118149.1118156","DOIUrl":"https://doi.org/10.3115/1118149.1118156","url":null,"abstract":"We describe our use of an existing resource, the Mouse Anatomical Nomenclature, to improve a symbolic interface to anatomically-indexed gene expression data. The goal is to reduce user effort in specifying anatomical structures of interest and increase precision and recall.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127300673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accenting unknown words in a specialized language 重读一种特殊语言中不认识的单词
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118153
Pierre Zweigenbaum, N. Grabar
We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from POS tagging, the other is based on finite state transducers.We show experimental results for letter e on the French version of the Medical Subject Headings thesaurus. With the best training set, the tagging method obtains a precision-recall breakeven point of 84.2±4.4% and the transducer method 83.8±4.5% (with a baseline at 64%) for the unknown words that contain this letter. A consensus combination of both increases precision to 92.0±3.7% with a recall of 75%. We perform an error analysis and discuss further steps that might help improve over the current performance.
我们提出了两种重读未知单词的内部方法,这两种方法都是在重读单词的参考集上学习给定字母的各种重读形式的出现上下文。一种方法是基于词性标注,另一种是基于有限状态传感器。我们在法语版医学主题词词典上展示字母e的实验结果。在最佳训练集下,标注法对包含该字母的未知单词的查全率盈亏平衡点为84.2±4.4%,换能器法为83.8±4.5%(基线为64%)。两者的一致性组合将精度提高到92.0±3.7%,召回率为75%。我们执行错误分析,并讨论可能有助于改进当前性能的进一步步骤。
{"title":"Accenting unknown words in a specialized language","authors":"Pierre Zweigenbaum, N. Grabar","doi":"10.3115/1118149.1118153","DOIUrl":"https://doi.org/10.3115/1118149.1118153","url":null,"abstract":"We propose two internal methods for accenting unknown words, which both learn on a reference set of accented words the contexts of occurrence of the various accented forms of a given letter. One method is adapted from POS tagging, the other is based on finite state transducers.We show experimental results for letter e on the French version of the Medical Subject Headings thesaurus. With the best training set, the tagging method obtains a precision-recall breakeven point of 84.2±4.4% and the transducer method 83.8±4.5% (with a baseline at 64%) for the unknown words that contain this letter. A consensus combination of both increases precision to 92.0±3.7% with a recall of 75%. We perform an error analysis and discuss further steps that might help improve over the current performance.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129451789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Biomedical text retrieval in languages with a complex morphology 复杂形态语言的生物医学文本检索
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118158
S. Schulz, Martin Honeck, U. Hahn
Document retrieval in languages with a rich and complex morphology - particularly in terms of derivation and (single-word) composition - suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, named entities, acronyms), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.
在具有丰富而复杂的词法的语言中进行文档检索——特别是在派生和(单词)组合方面——使用仅词干的查询术语到文本单词匹配范式会严重降低性能。我们提出了一种替代方法,将形态学复杂的词形式分割成相关的子词(如词根、命名实体、缩写词),子词构成索引和检索的基本单位。我们在一个大型生物医学文献集上评估我们的方法。
{"title":"Biomedical text retrieval in languages with a complex morphology","authors":"S. Schulz, Martin Honeck, U. Hahn","doi":"10.3115/1118149.1118158","DOIUrl":"https://doi.org/10.3115/1118149.1118158","url":null,"abstract":"Document retrieval in languages with a rich and complex morphology - particularly in terms of derivation and (single-word) composition - suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, named entities, acronyms), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a large biomedical document collection.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122411968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Utilizing text mining results: The Pasta Web System 利用文本挖掘结果:面食Web系统
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118160
G. Demetriou, R. Gaizauskas
Information Extraction (IE), defined as the activity to extract structured knowledge from unstructured text sources, offers new opportunities for the exploitation of biological information contained in the vast amounts of scientific literature. But while IE technology has received increasing attention in the area of molecular biology, there have not been many examples of IE systems successfully deployed in end-user applications. We describe the development of PASTAWeb, a WWW-based interface to the extraction output of PASTA, an IE system that extracts protein structure information from MEDLINE abstracts. Key characteristics of PASTAWeb are the seamless integration of the PASTA extraction results (templates) with WWW-based technology, the dynamic generation of WWW content from 'static' data and the fusion of information extracted from multiple documents.
信息提取(Information Extraction, IE)是指从非结构化文本源中提取结构化知识的活动,它为开发大量科学文献中包含的生物信息提供了新的机会。但是,虽然IE技术在分子生物学领域受到越来越多的关注,但在最终用户应用中成功部署IE系统的例子并不多。我们描述了PASTAWeb的开发,这是一个基于www的PASTA提取输出接口,一个从MEDLINE摘要中提取蛋白质结构信息的IE系统。PASTAWeb的主要特点是将PASTA提取结果(模板)与基于WWW的技术无缝集成,从“静态”数据动态生成WWW内容,以及融合从多个文档中提取的信息。
{"title":"Utilizing text mining results: The Pasta Web System","authors":"G. Demetriou, R. Gaizauskas","doi":"10.3115/1118149.1118160","DOIUrl":"https://doi.org/10.3115/1118149.1118160","url":null,"abstract":"Information Extraction (IE), defined as the activity to extract structured knowledge from unstructured text sources, offers new opportunities for the exploitation of biological information contained in the vast amounts of scientific literature. But while IE technology has received increasing attention in the area of molecular biology, there have not been many examples of IE systems successfully deployed in end-user applications. We describe the development of PASTAWeb, a WWW-based interface to the extraction output of PASTA, an IE system that extracts protein structure information from MEDLINE abstracts. Key characteristics of PASTAWeb are the seamless integration of the PASTA extraction results (templates) with WWW-based technology, the dynamic generation of WWW content from 'static' data and the fusion of information extracted from multiple documents.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122651578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Tagging gene and protein names in full text articles 在全文文章中标记基因和蛋白质名称
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118151
L. Tanabe, W. Wilbur
Current information extraction efforts in the biomedical domain tend to focus on finding entities and facts in structured databases or MEDLINE® abstracts. We apply a gene and protein name tagger trained on Medline abstracts (ABGene) to a randomly selected set of full text journal articles in the biomedical domain. We show the effect of adaptations made in response to the greater heterogeneity of full text.
当前生物医学领域的信息提取工作往往侧重于在结构化数据库或MEDLINE®摘要中查找实体和事实。我们将Medline摘要训练的基因和蛋白质名称标注器(ABGene)应用于随机选择的一组生物医学领域的全文期刊文章。我们展示了适应全文更大异质性的影响。
{"title":"Tagging gene and protein names in full text articles","authors":"L. Tanabe, W. Wilbur","doi":"10.3115/1118149.1118151","DOIUrl":"https://doi.org/10.3115/1118149.1118151","url":null,"abstract":"Current information extraction efforts in the biomedical domain tend to focus on finding entities and facts in structured databases or MEDLINE® abstracts. We apply a gene and protein name tagger trained on Medline abstracts (ABGene) to a randomly selected set of full text journal articles in the biomedical domain. We show the effect of adaptations made in response to the greater heterogeneity of full text.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129694041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Tuning support vector machines for biomedical named entity recognition 生物医学命名实体识别的支持向量机调优
Pub Date : 2002-07-11 DOI: 10.3115/1118149.1118150
Jun'ichi Kazama, Takaki Makino, Yoshihiro Ohta, Junichi Tsujii
We explore the use of Support Vector Machines (SVMs) for biomedical named entity recognition. To make the SVM training with the available largest corpus - the GENIA corpus - tractable, we propose to split the non-entity class into sub-classes, using part-of-speech information. In addition, we explore new features such as word cache and the states of an HMM trained by unsupervised learning. Experiments on the GENIA corpus show that our class splitting technique not only enables the training with the GENIA corpus but also improves the accuracy. The proposed new features also contribute to improve the accuracy. We compare our SVM-based recognition system with a system using Maximum Entropy tagging method.
我们探索了支持向量机(svm)在生物医学命名实体识别中的应用。为了使使用可用的最大语料库GENIA语料库的SVM训练易于处理,我们建议使用词性信息将非实体类划分为子类。此外,我们还探索了新的特征,如单词缓存和由无监督学习训练的HMM的状态。在GENIA语料库上的实验表明,我们的类分割技术不仅能够实现GENIA语料库的训练,而且提高了训练的准确率。提出的新功能也有助于提高准确性。我们将基于支持向量机的识别系统与使用最大熵标记方法的系统进行了比较。
{"title":"Tuning support vector machines for biomedical named entity recognition","authors":"Jun'ichi Kazama, Takaki Makino, Yoshihiro Ohta, Junichi Tsujii","doi":"10.3115/1118149.1118150","DOIUrl":"https://doi.org/10.3115/1118149.1118150","url":null,"abstract":"We explore the use of Support Vector Machines (SVMs) for biomedical named entity recognition. To make the SVM training with the available largest corpus - the GENIA corpus - tractable, we propose to split the non-entity class into sub-classes, using part-of-speech information. In addition, we explore new features such as word cache and the states of an HMM trained by unsupervised learning. Experiments on the GENIA corpus show that our class splitting technique not only enables the training with the GENIA corpus but also improves the accuracy. The proposed new features also contribute to improve the accuracy. We compare our SVM-based recognition system with a system using Maximum Entropy tagging method.","PeriodicalId":339993,"journal":{"name":"ACL Workshop on Natural Language Processing in the Biomedical Domain","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121888525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 283
期刊
ACL Workshop on Natural Language Processing in the Biomedical Domain
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1