The TREC 2004 genomics track categorization task: classifying full text biomedical documents.

Aaron M Cohen, William R Hersh
{"title":"The TREC 2004 genomics track categorization task: classifying full text biomedical documents.","authors":"Aaron M Cohen,&nbsp;William R Hersh","doi":"10.1186/1747-5333-1-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories.</p><p><strong>Results: </strong>The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task.</p><p><strong>Conclusion: </strong>Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004.</p>","PeriodicalId":87404,"journal":{"name":"Journal of biomedical discovery and collaboration","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2006-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1747-5333-1-4","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of biomedical discovery and collaboration","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/1747-5333-1-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 51

Abstract

Background: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories.

Results: The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets. The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task.

Conclusion: Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TREC 2004基因组跟踪分类任务:分类全文生物医学文献。
背景:TREC 2004基因组学专题会议的重点是应用信息检索和文本挖掘技术来提高基因组信息在生物医学中的应用。基因组学轨道包括两个主要任务,特别检索和文档分类。在本文中,我们描述了以全文文档分类为重点的分类任务,该任务模拟了小鼠基因组信息学(MGI)系统管理员的任务,由三个子任务组成。分类任务的一个子任务要求对可能有实验证据支持GO术语分配的文章进行分类,而其他两个子任务涉及将三个顶级GO类别分配给包含这些类别证据的每篇论文。结果:该赛道共有33个参与组。分诊子任务的均值和最大效用测度为0.3303,最高得分为0.6512。没有一种系统能够比简单地使用MeSH术语小鼠显著改善结果。对训练集和测试集之间显著特征重叠的分析发现比预期的要少。在收集中,分配给论文的GO术语的样本覆盖率非常少。确定包含GO术语证据的论文可能需要被视为GO中表示的每个概念的单独任务,因此需要比数据集中可用的更密集的采样。注释子任务的平均f值为0.3824,最高f值为0.5611。注释加证据码子任务的平均f值为0.3676,最高得分为0.4224。基因名称识别被发现对这项任务有好处。结论:GO注释文档的自动分类是一项具有挑战性的任务,GO代码层次和证据代码的自动提取也是一项具有挑战性的任务。然而,自动化这些任务将为生物医学管理提供实质性的好处,因此这一领域的工作必须继续下去。额外的经验将允许比较和进一步分析哪些算法特征在生物医学文档分类中最有用,并更好地理解使自动分类对生物医学文档管理可行和有用的任务特征。TREC基因组学跟踪将在2005年继续,重点放在更广泛的分类任务上,并从2004年开始改进结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. The language of discovery. Bias associated with mining electronic health records. Literature-based Resurrection of Neglected Medical Discoveries. A cognitive task analysis of a visual analytic workflow: Exploring molecular interaction networks in systems biology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1