AZuRE, a scalable system for automated term disambiguation of gene and protein names.

Proceedings. IEEE Computational Systems Bioinformatics Conference Pub Date : 2004-01-01 DOI:10.1109/csb.2004.1332454

Raf M Podowski, John G Cleary, Nicholas T Goncharoff, Gregory Amoutzias, William S Hayes

{"title":"AZuRE, a scalable system for automated term disambiguation of gene and protein names.","authors":"Raf M Podowski, John G Cleary, Nicholas T Goncharoff, Gregory Amoutzias, William S Hayes","doi":"10.1109/csb.2004.1332454","DOIUrl":null,"url":null,"abstract":"<p><p>Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"415-24"},"PeriodicalIF":0.0000,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332454","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/csb.2004.1332454","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AZuRE，一个可扩展的系统，用于基因和蛋白质名称的自动术语消歧。

由于缺乏标准的基因和蛋白质命名惯例，研究人员忍受了长时间的、有时毫无结果的文献搜索。描述了一个系统，该系统能够在以前未见过的MEDLINE摘要中自动将基因名称分配给它们的LocusLink ID (LLID)。该系统基于监督学习，并为每个LLID建立一个模型。所有llid的训练集自动从LocusLink和SwissProt数据库中的MEDLINE参考文献中提取。对所有20,546个具有llid的人类基因的性能进行了验证。其中，7344个产生了高质量的模型(f值> 0.7，其中近60% > 0.9)，13202个没有，主要是由于已知文献参考数量不足。一组66个基因的MEDLINE文档的手工验证与系统的内部准确性评估一致。结论是，使用可扩展的自动化技术可以实现高质量的基因消歧。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

自引率

0.00%

发文量