H. Chun, Chisato Yamasaki, Naomi Saichi, Masayuki Tanaka, T. Hishiki, T. Imanishi, T. Gojobori, Jin-Dong Kim, Junichi Tsujii, T. Takagi
This paper presents a novel prediction approach for protein sub-cellular localization. We have incorporated text and sequence-based approaches.
提出了一种新的蛋白质亚细胞定位预测方法。我们结合了文本和基于序列的方法。
{"title":"Prediction of Protein Sub-cellular Localization using Information from Texts and Sequences.","authors":"H. Chun, Chisato Yamasaki, Naomi Saichi, Masayuki Tanaka, T. Hishiki, T. Imanishi, T. Gojobori, Jin-Dong Kim, Junichi Tsujii, T. Takagi","doi":"10.3115/1572306.1572324","DOIUrl":"https://doi.org/10.3115/1572306.1572324","url":null,"abstract":"This paper presents a novel prediction approach for protein sub-cellular localization. We have incorporated text and sequence-based approaches.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130805553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Stevenson, Yikun Guo, R. Gaizauskas, David Martínez
Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including linguistic information (from the context in which the ambiguous term is used) and domain-specific resources (such as UMLS). In this paper we compare a range of knowledge sources which have been previously used and introduce a novel one: MeSH terms. The best performance is obtained using linguistic features in combination with MeSH terms. Results from our system outperform published results for previously reported systems on a standard test set (the NLM-WSD corpus).
{"title":"Knowledge Sources for Word Sense Disambiguation of Biomedical Text","authors":"Mark Stevenson, Yikun Guo, R. Gaizauskas, David Martínez","doi":"10.3115/1572306.1572321","DOIUrl":"https://doi.org/10.3115/1572306.1572321","url":null,"abstract":"Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including linguistic information (from the context in which the ambiguous term is used) and domain-specific resources (such as UMLS). In this paper we compare a range of knowledge sources which have been previously used and introduce a novel one: MeSH terms. The best performance is obtained using linguistic features in combination with MeSH terms. Results from our system outperform published results for previously reported systems on a standard test set (the NLM-WSD corpus).","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131375842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Yu, Nadya Frid, S. McRoy, R. Prasad, Alan Lee, A. Joshi
The goal of the Penn Discourse Treebank (PDTB) project is to develop a large-scale corpus, annotated with coherence relations marked by discourse connectives. Currently, the primary application of the PDTB annotation has been to news articles. In this study, we tested whether the PDTB guidelines can be adapted to a different genre. We annotated discourse connectives and their arguments in one 4,937-token full-text biomedical article. Two linguist annotators showed an agreement of 85% after simple conventions were added. For the remaining 15% cases, we found that biomedical domain-specific knowledge is needed to capture the linguistic cues that can be used to resolve inter-annotator disagreement. We found that the two annotators were able to reach an agreement after discussion. Thus our experiments suggest that the PDTB annotation can be adapted to new domains by minimally adjusting the guidelines and by adding some further domain-specific linguistic cues.
{"title":"A Pilot Annotation to Investigate Discourse Connectivity in Biomedical Text","authors":"Hong Yu, Nadya Frid, S. McRoy, R. Prasad, Alan Lee, A. Joshi","doi":"10.3115/1572306.1572325","DOIUrl":"https://doi.org/10.3115/1572306.1572325","url":null,"abstract":"The goal of the Penn Discourse Treebank (PDTB) project is to develop a large-scale corpus, annotated with coherence relations marked by discourse connectives. Currently, the primary application of the PDTB annotation has been to news articles. In this study, we tested whether the PDTB guidelines can be adapted to a different genre. We annotated discourse connectives and their arguments in one 4,937-token full-text biomedical article. Two linguist annotators showed an agreement of 85% after simple conventions were added. For the remaining 15% cases, we found that biomedical domain-specific knowledge is needed to capture the linguistic cues that can be used to resolve inter-annotator disagreement. We found that the two annotators were able to reach an agreement after discussion. Thus our experiments suggest that the PDTB annotation can be adapted to new domains by minimally adjusting the guidelines and by adding some further domain-specific linguistic cues.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130050004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a comparative study between two machine learning methods, Conditional Random Fields and Support Vector Machines for clinical named entity recognition. We explore their applicability to clinical domain. Evaluation against a set of gold standard named entities shows that CRFs outperform SVMs. The best F-score with CRFs is 0.86 and for the SVMs is 0.64 as compared to a baseline of 0.60.
{"title":"Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts","authors":"Dingcheng Li, G. Savova, K. Schuler","doi":"10.3115/1572306.1572326","DOIUrl":"https://doi.org/10.3115/1572306.1572326","url":null,"abstract":"We present a comparative study between two machine learning methods, Conditional Random Fields and Support Vector Machines for clinical named entity recognition. We explore their applicability to clinical domain. Evaluation against a set of gold standard named entities shows that CRFs outperform SVMs. The best F-score with CRFs is 0.86 and for the SVMs is 0.64 as compared to a baseline of 0.60.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115545257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Neves, M. Chagoyen, J. Carazo, A. Pascual-Montano
This work proposes a case-based classifier to tackle the gene/protein mention problem in biomedical literature. The so called gene mention problem consists of the recognition of gene and protein entities in scientific texts. A classification process aiming at deciding if a term is a gene mention or not is carried out for each word in the text. It is based on the selection of the best or most similar case in a base of known and unknown cases. The approach was evaluated on several datasets for different organisms and results show the suitability of this approach for the gene mention problem.
{"title":"CBR-Tagger: a case-based reasoning approach to the gene/protein mention problem","authors":"M. Neves, M. Chagoyen, J. Carazo, A. Pascual-Montano","doi":"10.3115/1572306.1572333","DOIUrl":"https://doi.org/10.3115/1572306.1572333","url":null,"abstract":"This work proposes a case-based classifier to tackle the gene/protein mention problem in biomedical literature. The so called gene mention problem consists of the recognition of gene and protein entities in scientific texts. A classification process aiming at deciding if a term is a gene mention or not is carried out for each word in the text. It is based on the selection of the best or most similar case in a base of known and unknown cases. The approach was evaluated on several datasets for different organisms and results show the suitability of this approach for the gene mention problem.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121479359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biomedical information extraction tasks are often more complex and contain uncertainty at each step during problem solving processes. We present an adaptive information extraction framework and demonstrate how to explore uncertainty using feedback integration.
{"title":"Adaptive Information Extraction for Complex Biomedical Tasks","authors":"D. Feng, Gully A. Burns, E. Hovy","doi":"10.3115/1572306.1572339","DOIUrl":"https://doi.org/10.3115/1572306.1572339","url":null,"abstract":"Biomedical information extraction tasks are often more complex and contain uncertainty at each step during problem solving processes. We present an adaptive information extraction framework and demonstrate how to explore uncertainty using feedback integration.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115207302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records, for clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between clinically important entities in the text. Typical approaches to relationship extraction in this domain have used full parses, domain-specific grammars, and large knowledge bases encoding domain knowledge. In other areas of biomedical NLP, statistical machine learning approaches are now routinely applied to relationship extraction. We report on the novel application of these statistical techniques to clinical relationships. We describe a supervised machine learning system, trained with a corpus of oncology narratives hand-annotated with clinically important relationships. Various shallow features are extracted from these texts, and used to train statistical classifiers. We compare the suitability of these features for clinical relationship extraction, how extraction varies between inter- and intra-sentential relationships, and examine the amount of training data needed to learn various relationships.
{"title":"Extracting Clinical Relationships from Patient Narratives","authors":"A. Roberts, R. Gaizauskas, Mark Hepple","doi":"10.3115/1572306.1572309","DOIUrl":"https://doi.org/10.3115/1572306.1572309","url":null,"abstract":"The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records, for clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between clinically important entities in the text. Typical approaches to relationship extraction in this domain have used full parses, domain-specific grammars, and large knowledge bases encoding domain knowledge. In other areas of biomedical NLP, statistical machine learning approaches are now routinely applied to relationship extraction. We report on the novel application of these statistical techniques to clinical relationships. \u0000 \u0000We describe a supervised machine learning system, trained with a corpus of oncology narratives hand-annotated with clinically important relationships. Various shallow features are extracted from these texts, and used to train statistical classifiers. We compare the suitability of these features for clinical relationship extraction, how extraction varies between inter- and intra-sentential relationships, and examine the amount of training data needed to learn various relationships.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132998206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Wang, Kazuhiro Yoshida, Jin-Dong Kim, Rune Saetre, Junichi Tsujii
While there are several corpora which claim to have annotations for protein references, the heterogeneity between the annotations is recognized as an obstacle to develop expensive resources in a synergistic way. Here we present a series of experimental results which show the differences of protein mention annotations made to two corpora, GENIA and AImed.
{"title":"Raising the Compatibility of Heterogeneous Annotations: A Case Study on","authors":"Yue Wang, Kazuhiro Yoshida, Jin-Dong Kim, Rune Saetre, Junichi Tsujii","doi":"10.3115/1572306.1572338","DOIUrl":"https://doi.org/10.3115/1572306.1572338","url":null,"abstract":"While there are several corpora which claim to have annotations for protein references, the heterogeneity between the annotations is recognized as an obstacle to develop expensive resources in a synergistic way. Here we present a series of experimental results which show the differences of protein mention annotations made to two corpora, GENIA and AImed.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123271773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.18653/v1/2023.bionlp-1.17
Vinayak Arannil, Tomal Deb, Atanu Roy
Early identification of Adverse Drug Events (ADE) is critical for taking prompt actions while introducing new drugs into the market. These ADEs information are available through various unstructured data sources like clinical study reports, patient health records, social media posts, etc. Extracting ADEs and the related suspect drugs using machine learning is a challenging task due to the complex linguistic relations between drug ADE pairs in textual data and unavailability of large corpus of labelled datasets. This paper introduces ADEQA, a question- answer(QA) based approach using quasi supervised labelled data and sequence-to-sequence transformers to extract ADEs, drug suspects and the relationships between them. Unlike traditional QA models, natural language generation (NLG) based models don’t require extensive token level labelling and thereby reduces the adoption barrier significantly. On a public ADE corpus, we were able to achieve state-of-the-art results with an F1 score of 94% on establishing the relationships between ADEs and the respective suspects.
{"title":"ADEQA: A Question Answer based approach for joint ADE-Suspect Extraction using Sequence-To-Sequence Transformers","authors":"Vinayak Arannil, Tomal Deb, Atanu Roy","doi":"10.18653/v1/2023.bionlp-1.17","DOIUrl":"https://doi.org/10.18653/v1/2023.bionlp-1.17","url":null,"abstract":"Early identification of Adverse Drug Events (ADE) is critical for taking prompt actions while introducing new drugs into the market. These ADEs information are available through various unstructured data sources like clinical study reports, patient health records, social media posts, etc. Extracting ADEs and the related suspect drugs using machine learning is a challenging task due to the complex linguistic relations between drug ADE pairs in textual data and unavailability of large corpus of labelled datasets. This paper introduces ADEQA, a question- answer(QA) based approach using quasi supervised labelled data and sequence-to-sequence transformers to extract ADEs, drug suspects and the relationships between them. Unlike traditional QA models, natural language generation (NLG) based models don’t require extensive token level labelling and thereby reduces the adoption barrier significantly. On a public ADE corpus, we were able to achieve state-of-the-art results with an F1 score of 94% on establishing the relationships between ADEs and the respective suspects.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123415083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wael Salloum, Greg P. Finley, Erik Edwards, Mark Miller, David Suendermann-Oeft
Dictated medical reports very often feature a preamble containing metainformation about the report such as patient and physician names, location and name of the clinic, date of procedure, and so on. In the medical transcription process, the preamble is usually omitted from the final report, as it contains information already available in the electronic medical record. We present a method which is able to automatically identify preambles in medical dictations. The method makes use of stateof-the-art NLP techniques including word embeddings and Bi-LSTMs and achieves preamble detection performance superior to humans.
{"title":"Automated Preamble Detection in Dictated Medical Reports","authors":"Wael Salloum, Greg P. Finley, Erik Edwards, Mark Miller, David Suendermann-Oeft","doi":"10.18653/v1/W17-2336","DOIUrl":"https://doi.org/10.18653/v1/W17-2336","url":null,"abstract":"Dictated medical reports very often feature a preamble containing metainformation about the report such as patient and physician names, location and name of the clinic, date of procedure, and so on. In the medical transcription process, the preamble is usually omitted from the final report, as it contains information already available in the electronic medical record. We present a method which is able to automatically identify preambles in medical dictations. The method makes use of stateof-the-art NLP techniques including word embeddings and Bi-LSTMs and achieves preamble detection performance superior to humans.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}