Protein annotators' assistant: A novel application of information retrieval techniques

M. Wise
{"title":"Protein annotators' assistant: A novel application of information retrieval techniques","authors":"M. Wise","doi":"10.1002/1097-4571(2000)9999:9999%3C::AID-ASI1020%3E3.0.CO;2-F","DOIUrl":null,"url":null,"abstract":"The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed key words and phrases that are extracted from the corresponding text records. Common words either in general English usage or from the biological domain are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop‐words (i.e., reject words), together with a list of accept‐words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.","PeriodicalId":50013,"journal":{"name":"Journal of the American Society for Information Science and Technology","volume":"31 1","pages":"1131-1136"},"PeriodicalIF":0.0000,"publicationDate":"2000-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Society for Information Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/1097-4571(2000)9999:9999%3C::AID-ASI1020%3E3.0.CO;2-F","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed key words and phrases that are extracted from the corresponding text records. Common words either in general English usage or from the biological domain are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop‐words (i.e., reject words), together with a list of accept‐words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
蛋白质注释者的助手:信息检索技术的新应用
Protein Annotators' Assistant(或PAA) (http://www.ebi.ac.uk/paa/)是一个软件系统,它可以帮助蛋白质注释者为新测序的蛋白质分配功能。PAA从SwissProt(一个描述已知蛋白质的数据库)和返回与查询相似的已知蛋白质列表的先前序列相似性搜索向后工作,建议可以描述查询执行的功能的关键字和短语。在预处理步骤中,根据出现在SwissProt数据库中的蛋白质名称建立数据库,并针对每个蛋白质列出从相应文本记录中提取的关键词和短语。在短语的组装过程中,无论是一般英语用法中的常用词还是来自生物领域的常用词都会被删除。该过程通过使用简单的词干提取算法来辅助,该算法扩展了停止词列表(即拒绝词)以及接受词列表。在运行时,由用户通过Web界面调用的搜索算法获取一个蛋白质名称列表,并将命名的蛋白质聚集在列表成员共享的关键字/短语周围。假设这些蛋白质有一个共同的关键字/短语,并且它们与查询蛋白质相关,那么关键字/短语也可以描述查询。总的来说,PAA在新的设置中使用了许多IR技术,因此与文本分类有关,其中可能建议使用多个类别,只是在这种情况下没有预先指定任何类别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
3.5 months
期刊最新文献
Information Resources Management in the Twenty-First Century: Challenges, Prospects, and the Librarian’s Role Technical Infrastructure to Support Public Value Co-creation in Smart City Perceived Usefulness of Web 2.0 Tools for Knowledge Management by University Undergraduate Students: A Review of Literature Group Emotion Recognition for Weibo Topics Based on BERT with TextCNN Research on the Service of Special Collections of University Libraries Empowered by Intelligent Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1