Francisco M. Couto, Tiago Grego, Hugo P. Bastos, Catia Pesquita, Rafael P. Torres Jiménez, Pablo Sánchez, Leandro Pascual, C. Blaschke
{"title":"Identifying bioentity recognition errors of rule-based text-mining systems","authors":"Francisco M. Couto, Tiago Grego, Hugo P. Bastos, Catia Pesquita, Rafael P. Torres Jiménez, Pablo Sánchez, Leandro Pascual, C. Blaschke","doi":"10.1109/ICDIM.2008.4746791","DOIUrl":null,"url":null,"abstract":"An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.","PeriodicalId":415013,"journal":{"name":"2008 Third International Conference on Digital Information Management","volume":"46 3-4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Third International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2008.4746791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
An important research topic in Bioinformatics involves the exploration of vast amounts of biological and biomedical scientific literature (BioLiterature). Over the last few decades, text-mining systems have exploited this BioLiterature to reduce the time spent by researchers in its analysis. However, state-of-the-art approaches are still far from reaching performance levels acceptable by curators, and below the performance obtained in other domains, such as personal name recognition or news text. To achieve high levels of performance, it is essential that text mining tools effectively recognize bioentities present in BioLiterature. This paper presents FIBRE (Filtering Bioentity Recognition Errors), a system for automatically filtering mis annotations generated by rule-based systems that automatically recognize bioentities in BioLiterature. FIBRE aims at using different sets of automatically generated annotations to identify the main features that characterize an annotation of being of a certain type. These features are then used to filter mis annotations using a confidence threshold. The assessment of FIBRE was performed on a set of more than 17,000 documents, previously annotated by Text Detective, a state-of-the-art rule-based name bioentity recognition system. Curators evaluated the gene annotations given by Text Detective that FIBRE classified as non-gene annotations, and we found that FIBRE was able to filter with a precision above 92% more than 600 mis annotations, requiring minimal human effort, which demonstrates the effectiveness of FIBRE in a realistic scenario.