基于规则的文本挖掘系统生物实体识别错误的识别

Francisco M. Couto, Tiago Grego, Hugo P. Bastos, Catia Pesquita, Rafael P. Torres Jiménez, Pablo Sánchez, Leandro Pascual, C. Blaschke
{"title":"基于规则的文本挖掘系统生物实体识别错误的识别","authors":"Francisco M. Couto, Tiago Grego, Hugo P. Bastos, Catia Pesquita, Rafael P. Torres Jiménez, Pablo Sánchez, Leandro Pascual, C. Blaschke","doi":"10.1109/ICDIM.2008.4746791","DOIUrl":null,"url":null,"abstract":"An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.","PeriodicalId":415013,"journal":{"name":"2008 Third International Conference on Digital Information Management","volume":"46 3-4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Identifying bioentity recognition errors of rule-based text-mining systems\",\"authors\":\"Francisco M. Couto, Tiago Grego, Hugo P. Bastos, Catia Pesquita, Rafael P. Torres Jiménez, Pablo Sánchez, Leandro Pascual, C. Blaschke\",\"doi\":\"10.1109/ICDIM.2008.4746791\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.\",\"PeriodicalId\":415013,\"journal\":{\"name\":\"2008 Third International Conference on Digital Information Management\",\"volume\":\"46 3-4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 Third International Conference on Digital Information Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDIM.2008.4746791\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Third International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2008.4746791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

生物信息学的一个重要研究课题是对大量生物和生物医学科学文献(BioLiterature)的探索。在过去的几十年里,文本挖掘系统利用这种生物文献来减少研究人员在分析中花费的时间。然而,最先进的方法仍远未达到策展人可接受的性能水平,并且低于其他领域(如个人姓名识别或新闻文本)所获得的性能。为了达到高水平的性能,文本挖掘工具有效识别生物文献中存在的生物实体是至关重要的。本文介绍了纤维(过滤生物实体识别错误),一个自动过滤由基于规则的系统生成的错误注释的系统,该系统自动识别生物文献中的生物实体。FIBRE的目的是使用不同的自动生成注释集来识别某种类型注释的主要特征。然后使用这些特性使用置信度阈值过滤错误的注释。fiber的评估是在一组超过17,000份文件上进行的,这些文件之前由Text Detective(一种最先进的基于规则的名称生物实体识别系统)注释。管理员评估了文本侦探给出的基因注释,纤维将其归类为非基因注释,我们发现纤维能够以超过92%的精度过滤600多个错误注释,需要最少的人力,这证明了纤维在现实场景中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Identifying bioentity recognition errors of rule-based text-mining systems
An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Grid-based Semantic Integration and dissemination of medical information A hierarchical model-based system for discovering atypical behavior Geo-information quality assurance in disaster management Remote controlled flying robot platform Ubiquitous computing and android
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1