Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel
{"title":"Learning Biological Sequence Types Using the Literature","authors":"Mohamed Reda Bouadjenek, Karin M. Verspoor, J. Zobel","doi":"10.1145/3132847.3133051","DOIUrl":null,"url":null,"abstract":"We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"137 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and non-assignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.
本文对生物序列数据库中记录的生物序列类型自动分类进行了探讨。序列类型属性提供了关于记录中表示的序列性质的重要信息,并且经常用于搜索以过滤掉不相关的序列。然而,序列类型属性通常是一个非强制性的自由文本字段,因此它容易出现许多错误,包括打字错误、赋值错误和未赋值。在GenBank中,这个问题涉及大约18%的记录,这一惊人的数字应该引起生物存储界的担忧。为了解决自动序列类型分类的问题,我们建议使用与序列记录相关的文献作为可用于分类任务的外部知识来源。我们定义了一组基于文献的特征,并训练机器学习算法将记录分类为六种主要序列类型之一。使用文献来完成这项任务的主要直觉是,序列似乎在科学文章中以不同的方式讨论,取决于它们的类型。我们在PubMed Central collection上进行的实验表明,文献确实是解决序列类型分类问题的有效方法。我们的分类方法达到了92.7%的准确率,并且大大优于用于比较的两种基线方法。