首页 > 最新文献

Workshop on Biomedical Natural Language Processing最新文献

英文 中文
A Biomedical Question Answering System in BioASQ 2017 生物医学问答系统在BioASQ 2017
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2337
Mourad Sarrouti, Said Ouatik El Alaoui
Question answering, the identification of short accurate answers to users questions, is a longstanding challenge widely studied over the last decades in the open domain. However, it still requires further efforts in the biomedical domain. In this paper, we describe our participation in phase B of task 5b in the 2017 BioASQ challenge using our biomedical question answering system. Our system, dealing with four types of questions (i.e., yes/no, factoid, list, and summary), is based on (1) a dictionary-based approach for generating the exact answers of yes/no questions, (2) UMLS metathesaurus and term frequency metric for extracting the exact answers of factoid and list questions, and (3) the BM25 model and UMLS concepts for retrieving the ideal answers (i.e., paragraph-sized summaries). Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems.
问题回答,即识别用户问题的简短准确答案,是过去几十年来在开放领域广泛研究的一个长期挑战。然而,在生物医学领域仍需进一步努力。在本文中,我们描述了我们在2017年BioASQ挑战中使用我们的生物医学问答系统参与任务5b的阶段B。我们的系统处理四种类型的问题(即,yes/no, factoid, list, and summary),基于(1)基于字典的方法生成yes/no问题的准确答案,(2)UMLS元词典和术语频率度量用于提取factoid和list问题的准确答案,以及(3)BM25模型和UMLS概念用于检索理想答案(即段落大小的摘要)。初步结果表明,与其他系统相比,我们的系统在精确答案和理想答案抽取任务上都取得了较好的结果。
{"title":"A Biomedical Question Answering System in BioASQ 2017","authors":"Mourad Sarrouti, Said Ouatik El Alaoui","doi":"10.18653/v1/W17-2337","DOIUrl":"https://doi.org/10.18653/v1/W17-2337","url":null,"abstract":"Question answering, the identification of short accurate answers to users questions, is a longstanding challenge widely studied over the last decades in the open domain. However, it still requires further efforts in the biomedical domain. In this paper, we describe our participation in phase B of task 5b in the 2017 BioASQ challenge using our biomedical question answering system. Our system, dealing with four types of questions (i.e., yes/no, factoid, list, and summary), is based on (1) a dictionary-based approach for generating the exact answers of yes/no questions, (2) UMLS metathesaurus and term frequency metric for extracting the exact answers of factoid and list questions, and (3) the BM25 model and UMLS concepts for retrieving the ideal answers (i.e., paragraph-sized summaries). Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"340 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123118956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System 在Twitter中检测个人药物摄入:一个标注语料库和基线分类系统
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2316
A. Klein, A. Sarker, Masoud Rouhizadeh, K. O’Connor, Graciela Gonzalez
Social media sites (e.g., Twitter) have been used for surveillance of drug safety at the population level, but studies that focus on the effects of medications on specific sets of individuals have had to rely on other sources of data. Mining social media data for this in-formation would require the ability to distinguish indications of personal medication in-take in this media. Towards that end, this paper presents an annotated corpus that can be used to train machine learning systems to determine whether a tweet that mentions a medication indicates that the individual posting has taken that medication at a specific time. To demonstrate the utility of the corpus as a training set, we present baseline results of supervised classification.
社交媒体网站(如Twitter)已被用于监测人口层面的药物安全性,但关注药物对特定人群的影响的研究不得不依赖其他数据来源。从社交媒体数据中挖掘这些信息需要有能力区分这种媒体中个人用药的适应症。为此,本文提出了一个带注释的语料库,可用于训练机器学习系统,以确定一条提到药物的推文是否表明个人在特定时间服用了该药物。为了展示语料库作为训练集的效用,我们给出了监督分类的基线结果。
{"title":"Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System","authors":"A. Klein, A. Sarker, Masoud Rouhizadeh, K. O’Connor, Graciela Gonzalez","doi":"10.18653/v1/W17-2316","DOIUrl":"https://doi.org/10.18653/v1/W17-2316","url":null,"abstract":"Social media sites (e.g., Twitter) have been used for surveillance of drug safety at the population level, but studies that focus on the effects of medications on specific sets of individuals have had to rely on other sources of data. Mining social media data for this in-formation would require the ability to distinguish indications of personal medication in-take in this media. Towards that end, this paper presents an annotated corpus that can be used to train machine learning systems to determine whether a tweet that mentions a medication indicates that the individual posting has taken that medication at a specific time. To demonstrate the utility of the corpus as a training set, we present baseline results of supervised classification.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"398 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123202313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations BioCreative VI精准医学轨道:创建一个训练语料库,用于挖掘受突变影响的蛋白质-蛋白质相互作用
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2321
R. Dogan, A. Chatr-aryamontri, Sun Kim, Chih-Hsuan Wei, Yifan Peng, Donald C. Comeau, Zhiyong Lu
The Precision Medicine Track in BioCre-ative VI aims to bring together the Bi-oNLP community for a novel challenge focused on mining the biomedical litera-ture in search of mutations and protein-protein interactions (PPI). In order to support this track with an effective train-ing dataset with limited curator time, the track organizers carefully reviewed Pub-Med articles from two different sources: curated public PPI databases, and the re-sults of state-of-the-art public text mining tools. We detail here the data collection, manual review and annotation process and describe this training corpus charac-teristics. We also describe a corpus per-formance baseline. This analysis will provide useful information to developers and researchers for comparing and devel-oping innovative text mining approaches for the BioCreative VI challenge and other Precision Medicine related applica-tions.
biocreative VI的精准医学Track旨在将Bi-oNLP社区聚集在一起,专注于挖掘生物医学文献以寻找突变和蛋白质-蛋白质相互作用(PPI)。为了在有限的策展人时间内用有效的训练数据集支持这个赛道,赛道组织者仔细审查了来自两个不同来源的Pub-Med文章:策展的公共PPI数据库和最先进的公共文本挖掘工具的结果。本文详细介绍了数据采集、人工评审和标注过程,并描述了该训练语料库的特点。我们还描述了语料库性能基线。该分析将为开发人员和研究人员提供有用的信息,以比较和开发针对BioCreative VI挑战和其他精准医学相关应用的创新文本挖掘方法。
{"title":"BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations","authors":"R. Dogan, A. Chatr-aryamontri, Sun Kim, Chih-Hsuan Wei, Yifan Peng, Donald C. Comeau, Zhiyong Lu","doi":"10.18653/v1/W17-2321","DOIUrl":"https://doi.org/10.18653/v1/W17-2321","url":null,"abstract":"The Precision Medicine Track in BioCre-ative VI aims to bring together the Bi-oNLP community for a novel challenge focused on mining the biomedical litera-ture in search of mutations and protein-protein interactions (PPI). In order to support this track with an effective train-ing dataset with limited curator time, the track organizers carefully reviewed Pub-Med articles from two different sources: curated public PPI databases, and the re-sults of state-of-the-art public text mining tools. We detail here the data collection, manual review and annotation process and describe this training corpus charac-teristics. We also describe a corpus per-formance baseline. This analysis will provide useful information to developers and researchers for comparing and devel-oping innovative text mining approaches for the BioCreative VI challenge and other Precision Medicine related applica-tions.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131853856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Detecting Dementia through Retrospective Analysis of Routine Blog Posts by Bloggers with Dementia 通过回顾性分析痴呆症博客作者的日常博客文章来检测痴呆症
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2329
Vaden Masrani, Gabriel Murray, Thalia Shoshana Field, G. Carenini
We investigate if writers with dementia can be automatically distinguished from those without by analyzing linguistic markers in written text, in the form of blog posts. We have built a corpus of several thousand blog posts, some by people with dementia and others by people with loved ones with dementia. We use this dataset to train and test several machine learning methods, and achieve prediction performance at a level far above the baseline.
我们研究是否可以通过分析博客文章形式的书面文本中的语言标记来自动区分患有痴呆症的作家。我们已经建立了一个数千篇博客文章的语料库,其中一些是痴呆症患者写的,另一些是痴呆症患者的亲人写的。我们使用该数据集来训练和测试几种机器学习方法,并在远高于基线的水平上实现预测性能。
{"title":"Detecting Dementia through Retrospective Analysis of Routine Blog Posts by Bloggers with Dementia","authors":"Vaden Masrani, Gabriel Murray, Thalia Shoshana Field, G. Carenini","doi":"10.18653/v1/W17-2329","DOIUrl":"https://doi.org/10.18653/v1/W17-2329","url":null,"abstract":"We investigate if writers with dementia can be automatically distinguished from those without by analyzing linguistic markers in written text, in the form of blog posts. We have built a corpus of several thousand blog posts, some by people with dementia and others by people with loved ones with dementia. We use this dataset to train and test several machine learning methods, and achieve prediction performance at a level far above the baseline.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115760810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Creation and evaluation of a dictionary-based tagger for virus species and proteins 基于字典的病毒和蛋白质标记器的创建和评价
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2311
H. Cook, R. Berzins, Cristina Leal Rodriguez, J. M. Cejuela, L. Jensen
ext mining automatically extracts information from the literature with the goal of making it available for further analysis, for example by incorporating it into biomedical databases. A key first step towards this goal is to identify and normalize the named entities, such as proteins and species, which are mentioned in text. Despite the large detrimental impact that viruses have on human and agricultural health, very little previous text-mining work has focused on identifying virus species and proteins in the literature. Here, we present an improved dictionary-based system for viral species and the first dictionary for viral proteins, which we benchmark on a new corpus of 300 manually annotated abstracts. We achieve 81.0% precision and 72.7% recall at the task of recognizing and normalizing viral species and 76.2% precision and 34.9% recall on viral proteins. These results are achieved despite the many challenges involved with the names of viral species and, especially, proteins. This work provides a foundation that can be used to extract more complicated relations about viruses from the literature.
Ext挖掘自动从文献中提取信息,目的是使其可用于进一步分析,例如通过将其合并到生物医学数据库中。实现这一目标的关键第一步是识别和规范文本中提到的命名实体,如蛋白质和物种。尽管病毒对人类和农业健康有巨大的有害影响,但以前的文本挖掘工作很少关注于识别文献中的病毒种类和蛋白质。在这里,我们提出了一个改进的基于词典的病毒物种系统和第一个病毒蛋白质词典,我们在300个手动注释摘要的新语料库上进行基准测试。在病毒物种识别和归一化任务中,我们达到了81.0%的准确率和72.7%的召回率;在病毒蛋白识别和归一化任务中,我们达到了76.2%的准确率和34.9%的召回率。尽管在命名病毒种类,特别是蛋白质方面存在许多挑战,但这些结果还是取得了。这项工作为从文献中提取更复杂的病毒关系提供了基础。
{"title":"Creation and evaluation of a dictionary-based tagger for virus species and proteins","authors":"H. Cook, R. Berzins, Cristina Leal Rodriguez, J. M. Cejuela, L. Jensen","doi":"10.18653/v1/W17-2311","DOIUrl":"https://doi.org/10.18653/v1/W17-2311","url":null,"abstract":"ext mining automatically extracts information from the literature with the goal of making it available for further analysis, for example by incorporating it into biomedical databases. A key first step towards this goal is to identify and normalize the named entities, such as proteins and species, which are mentioned in text. Despite the large detrimental impact that viruses have on human and agricultural health, very little previous text-mining work has focused on identifying virus species and proteins in the literature. Here, we present an improved dictionary-based system for viral species and the first dictionary for viral proteins, which we benchmark on a new corpus of 300 manually annotated abstracts. We achieve 81.0% precision and 72.7% recall at the task of recognizing and normalizing viral species and 76.2% precision and 34.9% recall on viral proteins. These results are achieved despite the many challenges involved with the names of viral species and, especially, proteins. This work provides a foundation that can be used to extract more complicated relations about viruses from the literature.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130105176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Tagging Funding Agencies and Grants in Scientific Articles using Sequential Learning Models 在科学文章中使用顺序学习模型标记资助机构和拨款
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2327
S. Kayal, Z. Afzal, G. Tsatsaronis, S. Katrenko, Pascal Coupet, Marius A. Doornenbal, M. Gregory
In this paper we present a solution for tagging funding bodies and grants in scientific articles using a combination of trained sequential learning models, namely conditional random fields (CRF), hidden markov models (HMM) and maximum entropy models (MaxEnt), on a benchmark set created in-house. We apply the trained models to address the BioASQ challenge 5c, which is a newly introduced task that aims to solve the problem of funding information extraction from scientific articles. Results in the dry-run data set of BioASQ task 5c show that the suggested approach can achieve a micro-recall of more than 85% in tagging both funding bodies and grants.
在本文中,我们提出了一个在科学文章中标记资助机构和拨款的解决方案,该解决方案使用经过训练的顺序学习模型的组合,即条件随机场(CRF),隐马尔可夫模型(HMM)和最大熵模型(MaxEnt),基于内部创建的基准集。我们将训练好的模型应用于解决BioASQ挑战5c,这是一个新引入的任务,旨在解决从科学文章中提取信息的资金问题。BioASQ任务5c的干运行数据集的结果表明,所建议的方法在标记资助机构和资助方面都可以实现85%以上的微召回。
{"title":"Tagging Funding Agencies and Grants in Scientific Articles using Sequential Learning Models","authors":"S. Kayal, Z. Afzal, G. Tsatsaronis, S. Katrenko, Pascal Coupet, Marius A. Doornenbal, M. Gregory","doi":"10.18653/v1/W17-2327","DOIUrl":"https://doi.org/10.18653/v1/W17-2327","url":null,"abstract":"In this paper we present a solution for tagging funding bodies and grants in scientific articles using a combination of trained sequential learning models, namely conditional random fields (CRF), hidden markov models (HMM) and maximum entropy models (MaxEnt), on a benchmark set created in-house. We apply the trained models to address the BioASQ challenge 5c, which is a newly introduced task that aims to solve the problem of funding information extraction from scientific articles. Results in the dry-run data set of BioASQ task 5c show that the suggested approach can achieve a micro-recall of more than 85% in tagging both funding bodies and grants.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124503220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Extracting Drug-Drug Interactions with Attention CNNs 利用注意力cnn提取药物-药物相互作用
Pub Date : 2017-08-01 DOI: 10.18653/v1/W17-2302
Masaki Asada, Makoto Miwa, Yutaka Sasaki
We propose a novel attention mechanism for a Convolutional Neural Network (CNN)-based Drug-Drug Interaction (DDI) extraction model. CNNs have been shown to have a great potential on DDI extraction tasks; however, attention mechanisms, which emphasize important words in the sentence of a target-entity pair, have not been investigated with the CNNs despite the fact that attention mechanisms are shown to be effective for a general domain relation classification task. We evaluated our model on the Task 9.2 of the DDIExtraction-2013 shared task. As a result, our attention mechanism improved the performance of our base CNN-based DDI model, and the model achieved an F-score of 69.12%, which is competitive with the state-of-the-art models.
我们提出了一种新的基于卷积神经网络(CNN)的药物-药物相互作用(DDI)提取模型的注意机制。cnn已经被证明在DDI提取任务上有很大的潜力;然而,尽管注意机制在一般领域关系分类任务中被证明是有效的,但cnn还没有研究注意机制,即在目标实体对的句子中强调重要单词的注意机制。我们在DDIExtraction-2013共享任务的任务9.2上评估了我们的模型。结果,我们的注意力机制提高了我们基于cnn的基础DDI模型的性能,该模型的f值达到了69.12%,与最先进的模型相比具有竞争力。
{"title":"Extracting Drug-Drug Interactions with Attention CNNs","authors":"Masaki Asada, Makoto Miwa, Yutaka Sasaki","doi":"10.18653/v1/W17-2302","DOIUrl":"https://doi.org/10.18653/v1/W17-2302","url":null,"abstract":"We propose a novel attention mechanism for a Convolutional Neural Network (CNN)-based Drug-Drug Interaction (DDI) extraction model. CNNs have been shown to have a great potential on DDI extraction tasks; however, attention mechanisms, which emphasize important words in the sentence of a target-entity pair, have not been investigated with the CNNs despite the fact that attention mechanisms are shown to be effective for a general domain relation classification task. We evaluated our model on the Task 9.2 of the DDIExtraction-2013 shared task. As a result, our attention mechanism improved the performance of our base CNN-based DDI model, and the model achieved an F-score of 69.12%, which is competitive with the state-of-the-art models.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116345359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Macquarie University at BioASQ 5b – Query-based Summarisation Techniques for Selecting the Ideal Answers 麦考瑞大学bioasq5b -选择理想答案的基于查询的总结技术
Pub Date : 2017-06-07 DOI: 10.18653/v1/W17-2308
Diego Mollá Aliod
Macquarie University’s contribution to the BioASQ challenge (Task 5b Phase B) focused on the use of query-based extractive summarisation techniques for the generation of the ideal answers. Four runs were submitted, with approaches ranging from a trivial system that selected the first n snippets, to the use of deep learning approaches under a regression framework. Our experiments and the ROUGE results of the five test batches of BioASQ indicate surprisingly good results for the trivial approach. Overall, most of our runs on the first three test batches achieved the best ROUGE-SU4 results in the challenge.
麦考瑞大学对BioASQ挑战(任务5b阶段B)的贡献集中在使用基于查询的提取总结技术来生成理想答案。提交了四次运行,方法范围从选择前n个片段的简单系统到在回归框架下使用深度学习方法。我们的实验和五个测试批次BioASQ的ROUGE结果表明,这种微不足道的方法取得了惊人的好结果。总的来说,我们在前三个测试批次上的大多数运行在挑战中获得了最好的ROUGE-SU4结果。
{"title":"Macquarie University at BioASQ 5b – Query-based Summarisation Techniques for Selecting the Ideal Answers","authors":"Diego Mollá Aliod","doi":"10.18653/v1/W17-2308","DOIUrl":"https://doi.org/10.18653/v1/W17-2308","url":null,"abstract":"Macquarie University’s contribution to the BioASQ challenge (Task 5b Phase B) focused on the use of query-based extractive summarisation techniques for the generation of the ideal answers. Four runs were submitted, with approaches ranging from a trivial system that selected the first n snippets, to the use of deep learning approaches under a regression framework. Our experiments and the ROUGE results of the five test batches of BioASQ indicate surprisingly good results for the trivial approach. Overall, most of our runs on the first three test batches achieved the best ROUGE-SU4 results in the challenge.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115948662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Deep learning for extracting protein-protein interactions from biomedical literature 从生物医学文献中提取蛋白质-蛋白质相互作用的深度学习
Pub Date : 2017-06-05 DOI: 10.18653/v1/W17-2304
Yifan Peng, Zhiyong Lu
State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN provides up to 6% F1-score improvement over rich feature-based methods and single-kernel methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on “difficult” instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.
蛋白质-蛋白质相互作用(PPI)提取的最新方法主要是基于特征或基于核的,利用词汇和句法信息。但是如何将这些知识融入到最近的深度学习方法中仍然是一个悬而未决的问题。在本文中,我们提出了一个基于多通道依赖的卷积神经网络模型(McDepCNN)。它将一个通道应用于句子中每个单词的嵌入向量,另一个通道应用于相应单词头部的嵌入向量。因此,该模型可以利用从不同渠道获得的更丰富的信息。在两个公共基准测试数据集(aims和BioInfer)上的实验表明,与基于丰富特征的方法和单核方法相比,McDepCNN提供了高达6%的f1分数提升。此外,McDepCNN在跨语料库评估上的f1分数比最先进的方法提高了24.4%,在“困难”实例上的f1分数比基于核的方法提高了12%。这些结果表明McDepCNN更容易在不同的语料库上进行泛化,并且能够捕获句子中的长距离特征。
{"title":"Deep learning for extracting protein-protein interactions from biomedical literature","authors":"Yifan Peng, Zhiyong Lu","doi":"10.18653/v1/W17-2304","DOIUrl":"https://doi.org/10.18653/v1/W17-2304","url":null,"abstract":"State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN provides up to 6% F1-score improvement over rich feature-based methods and single-kernel methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on “difficult” instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128639145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 93
Enhancing Automatic ICD-9-CM Code Assignment for Medical Texts with PubMed 利用PubMed增强医学文本的ICD-9-CM代码自动分配
Pub Date : 2017-05-22 DOI: 10.18653/v1/W17-2333
Danchen Zhang, Daqing He, Sanqiang Zhao, Lei Li
Assigning a standard ICD-9-CM code to disease symptoms in medical texts is an important task in the medical domain. Automating this process could greatly reduce the costs. However, the effectiveness of an automatic ICD-9-CM code classifier faces a serious problem, which can be triggered by unbalanced training data. Frequent diseases often have more training data, which helps its classification to perform better than that of an infrequent disease. However, a disease’s frequency does not necessarily reflect its importance. To resolve this training data shortage problem, we propose to strategically draw data from PubMed to enrich the training data when there is such need. We validate our method on the CMC dataset, and the evaluation results indicate that our method can significantly improve the code assignment classifiers’ performance at the macro-averaging level.
为医学文本中的疾病症状分配标准ICD-9-CM代码是医学领域的一项重要任务。自动化这个过程可以大大降低成本。然而,ICD-9-CM代码自动分类器的有效性面临着一个严重的问题,即训练数据的不平衡。常见疾病通常有更多的训练数据,这有助于其分类比不常见疾病的分类表现更好。然而,一种疾病的发生频率并不一定反映其重要性。为了解决这一训练数据不足的问题,我们提出在有需要的时候战略性地从PubMed中提取数据来丰富训练数据。我们在CMC数据集上验证了我们的方法,评估结果表明我们的方法可以显著提高代码分配分类器在宏观平均水平上的性能。
{"title":"Enhancing Automatic ICD-9-CM Code Assignment for Medical Texts with PubMed","authors":"Danchen Zhang, Daqing He, Sanqiang Zhao, Lei Li","doi":"10.18653/v1/W17-2333","DOIUrl":"https://doi.org/10.18653/v1/W17-2333","url":null,"abstract":"Assigning a standard ICD-9-CM code to disease symptoms in medical texts is an important task in the medical domain. Automating this process could greatly reduce the costs. However, the effectiveness of an automatic ICD-9-CM code classifier faces a serious problem, which can be triggered by unbalanced training data. Frequent diseases often have more training data, which helps its classification to perform better than that of an infrequent disease. However, a disease’s frequency does not necessarily reflect its importance. To resolve this training data shortage problem, we propose to strategically draw data from PubMed to enrich the training data when there is such need. We validate our method on the CMC dataset, and the evaluation results indicate that our method can significantly improve the code assignment classifiers’ performance at the macro-averaging level.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129645130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
Workshop on Biomedical Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1