首页 > 最新文献

Workshop on Biomedical Natural Language Processing最新文献

英文 中文
Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints 利用瓶颈适配器在低资源约束下识别临床记录中的癌症
Pub Date : 2022-10-17 DOI: 10.48550/arXiv.2210.09440
Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, Bronner P. Gonccalves, C. Kartsonaki, Isaric Clinical Characterisation Group, L. Merson, D. Clifton
Processing information locked within clinical health records is a challenging task that remains an active area of research in biomedical NLP. In this work, we evaluate a broad set of machine learning techniques ranging from simple RNNs to specialised transformers such as BioBERT on a dataset containing clinical notes along with a set of annotations indicating whether a sample is cancer-related or not. Furthermore, we specifically employ efficient fine-tuning methods from NLP, namely, bottleneck adapters and prompt tuning, to adapt the models to our specialised task. Our evaluations suggest that fine-tuning a frozen BERT model pre-trained on natural language and with bottleneck adapters outperforms all other strategies, including full fine-tuning of the specialised BioBERT model. Based on our findings, we suggest that using bottleneck adapters in low-resource situations with limited access to labelled data or processing capacity could be a viable strategy in biomedical text mining.
处理锁定在临床健康记录中的信息是一项具有挑战性的任务,也是生物医学NLP研究的一个活跃领域。在这项工作中,我们评估了一组广泛的机器学习技术,从简单的rnn到专门的转换器,如BioBERT,在一个包含临床记录的数据集上,以及一组指示样本是否与癌症相关的注释。此外,我们特别采用来自NLP的有效微调方法,即瓶颈适配器和提示调整,以使模型适应我们的专业任务。我们的评估表明,对预先在自然语言和瓶颈适配器上训练过的冻结BERT模型进行微调优于所有其他策略,包括对专门的BioBERT模型进行全面微调。基于我们的研究结果,我们建议在资源匮乏的情况下使用瓶颈适配器,对标记数据的访问或处理能力有限,这可能是生物医学文本挖掘的一个可行策略。
{"title":"Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints","authors":"Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, Bronner P. Gonccalves, C. Kartsonaki, Isaric Clinical Characterisation Group, L. Merson, D. Clifton","doi":"10.48550/arXiv.2210.09440","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09440","url":null,"abstract":"Processing information locked within clinical health records is a challenging task that remains an active area of research in biomedical NLP. In this work, we evaluate a broad set of machine learning techniques ranging from simple RNNs to specialised transformers such as BioBERT on a dataset containing clinical notes along with a set of annotations indicating whether a sample is cancer-related or not. Furthermore, we specifically employ efficient fine-tuning methods from NLP, namely, bottleneck adapters and prompt tuning, to adapt the models to our specialised task. Our evaluations suggest that fine-tuning a frozen BERT model pre-trained on natural language and with bottleneck adapters outperforms all other strategies, including full fine-tuning of the specialised BioBERT model. Based on our findings, we suggest that using bottleneck adapters in low-resource situations with limited access to labelled data or processing capacity could be a viable strategy in biomedical text mining.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124678993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ICDBigBird: A Contextual Embedding Model for ICD Code Classification ICDBigBird:用于ICD代码分类的上下文嵌入模型
Pub Date : 2022-04-21 DOI: 10.48550/arXiv.2204.10408
George Michalopoulos, Michal Malyska, Nicola Sahar, Alexander Wong, Helen H. Chen
The International Classification of Diseases (ICD) system is the international standard for classifying diseases and procedures during a healthcare encounter and is widely used for healthcare reporting and management purposes. Assigning correct codes for clinical procedures is important for clinical, operational and financial decision-making in healthcare. Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks. However, these models have yet to achieve state-of-the-art results in the ICD classification task since one of their main disadvantages is that they can only process documents that contain a small number of tokens which is rarely the case with real patient notes. In this paper, we introduce ICDBigBird a BigBird-based model which can integrate a Graph Convolutional Network (GCN), that takes advantage of the relations between ICD codes in order to create ‘enriched’ representations of their embeddings, with a BigBird contextual model that can process larger documents. Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task as it outperforms the previous state-of-the-art models.
国际疾病分类(ICD)系统是在医疗过程中对疾病和程序进行分类的国际标准,广泛用于医疗报告和管理目的。为临床程序分配正确的代码对于医疗保健中的临床、操作和财务决策非常重要。上下文词嵌入模型在多个NLP任务中取得了最先进的结果。然而,这些模型在ICD分类任务中还没有达到最先进的结果,因为它们的主要缺点之一是它们只能处理包含少量令牌的文档,而真正的患者笔记很少出现这种情况。在本文中,我们介绍了一个基于BigBird的模型ICDBigBird,它可以集成一个图卷积网络(GCN),利用ICD代码之间的关系来创建它们嵌入的“丰富”表示,以及一个可以处理更大文档的BigBird上下文模型。我们在现实世界临床数据集上的实验证明了基于bigbird的模型在ICD分类任务上的有效性,因为它优于以前最先进的模型。
{"title":"ICDBigBird: A Contextual Embedding Model for ICD Code Classification","authors":"George Michalopoulos, Michal Malyska, Nicola Sahar, Alexander Wong, Helen H. Chen","doi":"10.48550/arXiv.2204.10408","DOIUrl":"https://doi.org/10.48550/arXiv.2204.10408","url":null,"abstract":"The International Classification of Diseases (ICD) system is the international standard for classifying diseases and procedures during a healthcare encounter and is widely used for healthcare reporting and management purposes. Assigning correct codes for clinical procedures is important for clinical, operational and financial decision-making in healthcare. Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks. However, these models have yet to achieve state-of-the-art results in the ICD classification task since one of their main disadvantages is that they can only process documents that contain a small number of tokens which is rarely the case with real patient notes. In this paper, we introduce ICDBigBird a BigBird-based model which can integrate a Graph Convolutional Network (GCN), that takes advantage of the relations between ICD codes in order to create ‘enriched’ representations of their embeddings, with a BigBird contextual model that can process larger documents. Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task as it outperforms the previous state-of-the-art models.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129291982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study SNP2Vec:全基因组关联研究的可扩展自监督预训练
Pub Date : 2022-04-14 DOI: 10.48550/arXiv.2204.06699
Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany Mak, Xiaopu Zhou, N. Ip, Pascale Fung
Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer’s disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences.
自监督预训练方法在文本、图像和语音的理解方面带来了显著的突破。基因组学的最新发展也采用了这些预训练方法来理解基因组。然而,他们只关注于理解单倍体序列,这阻碍了他们对遗传变异的理解,也被称为单核苷酸多态性(snp),这对全基因组关联研究至关重要。在本文中,我们介绍了SNP2Vec,一种可扩展的自监督预训练方法,用于理解SNP。我们应用SNP2Vec进行长序列基因组学建模,并评估我们的方法在预测中国队列阿尔茨海默病风险方面的有效性。我们的方法明显优于现有的多基因风险评分方法和所有其他基线,包括完全用单倍体序列训练的模型。
{"title":"SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study","authors":"Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany Mak, Xiaopu Zhou, N. Ip, Pascale Fung","doi":"10.48550/arXiv.2204.06699","DOIUrl":"https://doi.org/10.48550/arXiv.2204.06699","url":null,"abstract":"Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer’s disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"59 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120907677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Doctor XAvIer: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation 泽维尔医生:对医患对话和XAI评估的可解释诊断
Pub Date : 2022-04-11 DOI: 10.48550/arXiv.2204.10178
Hillary Ngai, Frank Rudzicz
We introduce Doctor XAvIer — a BERT-based diagnostic system that extracts relevant clinical data from transcribed patient-doctor dialogues and explains predictions using feature attribution methods. We present a novel performance plot and evaluation metric for feature attribution methods — Feature Attribution Dropping (FAD) curve and its Normalized Area Under the Curve (N-AUC). FAD curve analysis shows that integrated gradients outperforms Shapley values in explaining diagnosis classification. Doctor XAvIer outperforms the baseline with 0.97 F1-score in named entity recognition and symptom pertinence classification and 0.91 F1-score in diagnosis classification.
我们介绍了医生XAvIer -一个基于bert的诊断系统,从转录的患者-医生对话中提取相关临床数据,并使用特征归因方法解释预测。提出了一种新的特征归因方法性能图和评价指标——特征归因下降曲线及其归一化曲线下面积(N-AUC)。FAD曲线分析表明,综合梯度在解释诊断分类方面优于Shapley值。dr . XAvIer在命名实体识别和症状针对性分类上优于基线得分0.97 f1,在诊断分类上优于基线得分0.91 f1。
{"title":"Doctor XAvIer: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation","authors":"Hillary Ngai, Frank Rudzicz","doi":"10.48550/arXiv.2204.10178","DOIUrl":"https://doi.org/10.48550/arXiv.2204.10178","url":null,"abstract":"We introduce Doctor XAvIer — a BERT-based diagnostic system that extracts relevant clinical data from transcribed patient-doctor dialogues and explains predictions using feature attribution methods. We present a novel performance plot and evaluation metric for feature attribution methods — Feature Attribution Dropping (FAD) curve and its Normalized Area Under the Curve (N-AUC). FAD curve analysis shows that integrated gradients outperforms Shapley values in explaining diagnosis classification. Doctor XAvIer outperforms the baseline with 0.97 F1-score in named entity recognition and symptom pertinence classification and 0.91 F1-score in diagnosis classification.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133453764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts 编码混合临床文本粗粒度去识别的少射跨语转移
Pub Date : 2022-04-10 DOI: 10.48550/arXiv.2204.04775
Saadullah Amin, Noon Pokaratsiri Goldstein, M. Wixted, Alejandro Garc'ia-Rudolph, Catalina Mart'inez-Costa, G. Neumann
Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected health information (PHI), are exposed to information extraction tools for downstream applications, risking patient identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual BERT (mBERT) (CITATION) from the MEDDOCAN (CITATION) corpus with our few-shot cross-lingual target corpus. When generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.
尽管数字医疗保健系统提供了精心编排的结构化知识,但许多关键信息仍然存在于大量未标记和非结构化的临床文本中。这些文本通常包含受保护的健康信息(PHI),它们暴露给下游应用程序的信息提取工具,可能导致患者身份识别。现有的去识别工作依赖于使用大规模的带注释的英语语料库,这通常不适合现实世界的多语言环境。预训练语言模型(LM)在低资源环境下的跨语言迁移中显示出巨大的潜力。在这项工作中,我们通过经验展示了命名实体识别(NER)的LMs的少射跨语言迁移特性,并将其应用于解决代码混合(西班牙语-加泰罗尼亚语)临床笔记在中风领域去识别的低资源和现实挑战。我们注释了一个黄金评估数据集来评估少量的设置性能,其中我们只使用几百个标记的示例进行训练。当我们的模型将MEDDOCAN (CITATION)语料库中的Multilingual BERT (mBERT) (CITATION)与我们的少镜头跨语言目标语料库相结合时,我们的模型将黄金评价集上的零镜头f1得分从73.7%提高到91.2%。当推广到样本外测试集时,最佳模型的人类评价f1得分为97.2%。
{"title":"Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts","authors":"Saadullah Amin, Noon Pokaratsiri Goldstein, M. Wixted, Alejandro Garc'ia-Rudolph, Catalina Mart'inez-Costa, G. Neumann","doi":"10.48550/arXiv.2204.04775","DOIUrl":"https://doi.org/10.48550/arXiv.2204.04775","url":null,"abstract":"Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected health information (PHI), are exposed to information extraction tools for downstream applications, risking patient identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual BERT (mBERT) (CITATION) from the MEDDOCAN (CITATION) corpus with our few-shot cross-lingual target corpus. When generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122994600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model 生物医学生成语言模型的预训练和评估
Pub Date : 2022-04-08 DOI: 10.48550/arXiv.2204.03905
Hongyi Yuan, Zheng Yuan, Ruyi Gan, Jiaxing Zhang, Yutao Xie, Sheng Yu
Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.
预训练语言模型是自然语言处理的重要支柱。近年来,领域内预训练已被证明有利于各种特定于领域的下游任务。在生物医学领域,自然语言生成(NLG)任务至关重要,但尚未得到充分研究。将自然语言理解(NLU)任务视为自然语言理解(NLG),通过约束语言生成或语言提示在一般领域实现令人满意的性能。我们强调在生物医学领域缺乏领域内生成语言模型和非系统生成下游基准,阻碍了研究界的发展。在这项工作中,我们引入了生成语言模型BioBART,使BART适应生物医学领域。我们整理了各种生物医学语言生成任务,包括对话、摘要、实体链接和命名实体识别。与BART相比,在PubMed摘要上进行预训练的BioBART提高了性能,并在几个任务上设置了强大的基线。此外,我们对BioBART的预训练任务进行了消融研究,发现句子排列对下游任务有负面影响。
{"title":"BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model","authors":"Hongyi Yuan, Zheng Yuan, Ruyi Gan, Jiaxing Zhang, Yutao Xie, Sheng Yu","doi":"10.48550/arXiv.2204.03905","DOIUrl":"https://doi.org/10.48550/arXiv.2204.03905","url":null,"abstract":"Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122475303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
A sequence-to-sequence approach for document-level relation extraction 用于文档级关系提取的序列到序列方法
Pub Date : 2022-04-03 DOI: 10.48550/arXiv.2204.01098
John Giorgi, Gary D Bader, Bo Wang
Motivated by the fact that many relations cross the sentence boundary, there has been increasing interest in document-level relation extraction (DocRE). DocRE requires integrating information within and across sentences, capturing complex interactions between mentions of entities. Most existing methods are pipeline-based, requiring entities as input. However, jointly learning to extract entities and relations can improve performance and be more efficient due to shared parameters and training steps. In this paper, we develop a sequence-to-sequence approach, seq2rel, that can learn the subtasks of DocRE (entity extraction, coreference resolution and relation extraction) end-to-end, replacing a pipeline of task-specific components. Using a simple strategy we call entity hinting, we compare our approach to existing pipeline-based methods on several popular biomedical datasets, in some cases exceeding their performance. We also report the first end-to-end results on these datasets for future comparison. Finally, we demonstrate that, under our model, an end-to-end approach outperforms a pipeline-based approach. Our code, data and trained models are available at https://github.com/johngiorgi/seq2rel. An online demo is available at https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py.
由于许多关系跨越句子边界,人们对文档级关系提取(DocRE)越来越感兴趣。DocRE需要整合句子内部和句子之间的信息,捕捉实体提及之间的复杂交互。大多数现有的方法都是基于管道的,需要实体作为输入。然而,由于共享参数和训练步骤,联合学习提取实体和关系可以提高性能和效率。在本文中,我们开发了一种序列到序列的方法,seq2rel,它可以端到端学习DocRE的子任务(实体提取、共同引用解析和关系提取),取代了任务特定组件的管道。我们使用一种称为实体暗示的简单策略,将我们的方法与几种流行的生物医学数据集上现有的基于流水线的方法进行比较,在某些情况下超过了它们的性能。我们还报告了这些数据集的第一个端到端结果,以便将来进行比较。最后,我们证明,在我们的模型下,端到端方法优于基于管道的方法。我们的代码、数据和经过训练的模型可在https://github.com/johngiorgi/seq2rel上获得。在线演示可在https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py获得。
{"title":"A sequence-to-sequence approach for document-level relation extraction","authors":"John Giorgi, Gary D Bader, Bo Wang","doi":"10.48550/arXiv.2204.01098","DOIUrl":"https://doi.org/10.48550/arXiv.2204.01098","url":null,"abstract":"Motivated by the fact that many relations cross the sentence boundary, there has been increasing interest in document-level relation extraction (DocRE). DocRE requires integrating information within and across sentences, capturing complex interactions between mentions of entities. Most existing methods are pipeline-based, requiring entities as input. However, jointly learning to extract entities and relations can improve performance and be more efficient due to shared parameters and training steps. In this paper, we develop a sequence-to-sequence approach, seq2rel, that can learn the subtasks of DocRE (entity extraction, coreference resolution and relation extraction) end-to-end, replacing a pipeline of task-specific components. Using a simple strategy we call entity hinting, we compare our approach to existing pipeline-based methods on several popular biomedical datasets, in some cases exceeding their performance. We also report the first end-to-end results on these datasets for future comparison. Finally, we demonstrate that, under our model, an end-to-end approach outperforms a pipeline-based approach. Our code, data and trained models are available at https://github.com/johngiorgi/seq2rel. An online demo is available at https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126307883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations 学习细粒度术语表示的生物医学术语自动聚类
Pub Date : 2022-04-01 DOI: 10.48550/arXiv.2204.00391
Sihang Zeng, Zheng Yuan, Sheng Yu
Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.
术语聚类是生物医学知识图谱构建的重要内容。使用术语之间的相似性嵌入有助于术语聚类。最先进的术语嵌入利用预训练的语言模型来编码术语,并使用知识图中的同义词和关系知识来指导对比学习。这些嵌入为属于同一概念的术语提供了紧密的嵌入。然而,从我们的探测实验来看,这些嵌入对微小的文本差异不敏感,导致生物医学术语聚类失败。为了缓解这一问题,我们调整了预训练术语嵌入的采样策略,在对比学习过程中提供动态硬正、负样本,以学习细粒度表示,从而获得更好的生物医学术语聚类。我们将提出的方法命名为coder++,并在新发布的生物医学知识图谱BIOS中应用于生物医学概念的聚类。
{"title":"Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations","authors":"Sihang Zeng, Zheng Yuan, Sheng Yu","doi":"10.48550/arXiv.2204.00391","DOIUrl":"https://doi.org/10.48550/arXiv.2204.00391","url":null,"abstract":"Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127089437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Position-based Prompting for Health Outcome Generation 基于位置的健康结果生成提示
Pub Date : 2022-03-30 DOI: 10.48550/arXiv.2204.03489
Micheal Abaho, D. Bollegala, P. Williamson, S. Dodd
Probing factual knowledge in Pre-trained Language Models (PLMs) using prompts has indirectly implied that language models (LMs) can be treated as knowledge bases. To this end, this phenomenon has been effective, especially when these LMs are fine-tuned towards not just data, but also to the style or linguistic pattern of the prompts themselves. We observe that satisfying a particular linguistic pattern in prompts is an unsustainable, time-consuming constraint in the probing task, especially because they are often manually designed and the range of possible prompt template patterns can vary depending on the prompting task. To alleviate this constraint, we propose using a position-attention mechanism to capture positional information of each word in a prompt relative to the mask to be filled, hence avoiding the need to re-construct prompts when the prompts’ linguistic pattern changes. Using our approach, we demonstrate the ability of eliciting answers (in a case study on health outcome generation) to not only common prompt templates like Cloze and Prefix but also rare ones too, such as Postfix and Mixed patterns whose masks are respectively at the start and in multiple random places of the prompt. More so, using various biomedical PLMs, our approach consistently outperforms a baseline in which the default PLMs representation is used to predict masked tokens.
在预训练语言模型(PLMs)中使用提示探查事实知识,间接暗示语言模型(LMs)可以被视为知识库。为此,这种现象是有效的,特别是当这些lm不仅针对数据,而且针对提示本身的风格或语言模式进行微调时。我们观察到,在提示语中满足特定的语言模式是探测任务中不可持续的、耗时的限制,特别是因为它们通常是人工设计的,并且提示语模板模式的范围可能因提示任务而异。为了缓解这一限制,我们建议使用位置注意机制来捕获提示语中每个单词相对于待填充掩码的位置信息,从而避免当提示语的语言模式发生变化时需要重新构建提示语。使用我们的方法,我们展示了引出答案的能力(在健康结果生成的案例研究中),不仅常见的提示模板,如Cloze和Prefix,也罕见的,如Postfix和Mixed模式,其掩码分别在提示符的开始和多个随机位置。更重要的是,使用各种生物医学plm,我们的方法始终优于使用默认plm表示来预测掩码令牌的基线。
{"title":"Position-based Prompting for Health Outcome Generation","authors":"Micheal Abaho, D. Bollegala, P. Williamson, S. Dodd","doi":"10.48550/arXiv.2204.03489","DOIUrl":"https://doi.org/10.48550/arXiv.2204.03489","url":null,"abstract":"Probing factual knowledge in Pre-trained Language Models (PLMs) using prompts has indirectly implied that language models (LMs) can be treated as knowledge bases. To this end, this phenomenon has been effective, especially when these LMs are fine-tuned towards not just data, but also to the style or linguistic pattern of the prompts themselves. We observe that satisfying a particular linguistic pattern in prompts is an unsustainable, time-consuming constraint in the probing task, especially because they are often manually designed and the range of possible prompt template patterns can vary depending on the prompting task. To alleviate this constraint, we propose using a position-attention mechanism to capture positional information of each word in a prompt relative to the mask to be filled, hence avoiding the need to re-construct prompts when the prompts’ linguistic pattern changes. Using our approach, we demonstrate the ability of eliciting answers (in a case study on health outcome generation) to not only common prompt templates like Cloze and Prefix but also rare ones too, such as Postfix and Mixed patterns whose masks are respectively at the start and in multiple random places of the prompt. More so, using various biomedical PLMs, our approach consistently outperforms a baseline in which the default PLMs representation is used to predict masked tokens.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127290848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Slot Filling for Biomedical Information Extraction 生物医学信息提取的槽填充
Pub Date : 2021-09-17 DOI: 10.18653/v1/2022.bionlp-1.7
Yannis Papanikolaou, Francine Bennett
Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in domains with limited resources such as biomedicine. In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing us to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reading comprehension model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and datasets are available at https://github.com/tba.
从文本中提取信息是指从非结构化文本中提取结构化知识的任务。该任务通常由一系列子任务组成,如命名实体识别和关系提取。在生物医学等资源有限的领域中,寻找特定实体和关系类型的训练数据是一个主要的瓶颈。在这项工作中,我们提出了一种用于生物医学IE任务的槽填充方法,有效地取代了对实体和特定关系的训练数据的需求,使我们能够处理零射击设置。我们遵循最近提出的范例,将基于transformer的双编码器(Dense Passage Retrieval)与基于transformer的阅读理解模型耦合在一起,从生物医学文本中提取关系。我们组装了一个用于检索和阅读理解的生物医学槽填充数据集,并进行了一系列实验,证明我们的方法优于一些更简单的基线。我们还评估了我们的方法端到端的标准和零射击设置。在缺乏相关训练数据的情况下,我们的工作为如何解决生物医学IE任务提供了一个新的视角。我们的代码、模型和数据集可在https://github.com/tba上获得。
{"title":"Slot Filling for Biomedical Information Extraction","authors":"Yannis Papanikolaou, Francine Bennett","doi":"10.18653/v1/2022.bionlp-1.7","DOIUrl":"https://doi.org/10.18653/v1/2022.bionlp-1.7","url":null,"abstract":"Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in domains with limited resources such as biomedicine. In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing us to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reading comprehension model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and datasets are available at https://github.com/tba.","PeriodicalId":200974,"journal":{"name":"Workshop on Biomedical Natural Language Processing","volume":"61 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132694517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Workshop on Biomedical Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1