首页 > 最新文献

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing最新文献

英文 中文
LIDDIA: Language-based Intelligent Drug Discovery Agent. LIDDIA:基于语言的智能药物发现代理。
Reza Averly, Frazier N Baker, Ian A Watson, Xia Ning

Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDiA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDiA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDiA, demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA.

药物发现是一个漫长、昂贵和复杂的过程,在很大程度上依赖于人类药物化学家,他们可能花费数年时间寻找潜在疗法的广阔空间。人工智能在化学领域的最新进展,旨在加快单个药物发现任务;然而,仍然迫切需要一种能够导航药物发现过程的智能代理。为此,我们介绍了LIDDiA,一个能够在计算机上智能导航药物发现过程的自主代理。通过利用大型语言模型的推理能力,LIDDiA可以作为自主药物发现的低成本和高适应性工具。我们全面研究了LIDDiA,证明(1)它可以在30个临床相关靶点的70%以上产生符合关键药物标准的分子,(2)它智能地平衡了化学空间的探索和开发,(3)它确定了一个有希望的AR/NR3C4的新候选物,这是前列腺癌和乳腺癌的关键靶点。代码和数据集可从https://github.com/ninglab/LIDDiA获得。
{"title":"LIDDIA: Language-based Intelligent Drug Discovery Agent.","authors":"Reza Averly, Frazier N Baker, Ian A Watson, Xia Ning","doi":"10.18653/v1/2025.emnlp-main.603","DOIUrl":"10.18653/v1/2025.emnlp-main.603","url":null,"abstract":"<p><p>Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDiA, an autonomous agent capable of intelligently navigating the drug discovery process <i>in silico</i>. By leveraging the reasoning capabilities of large language models, LIDDiA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDiA, demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2025 ","pages":"12015-12039"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12765491/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145907494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chatbot To Help Patients Understand Their Health. 聊天机器人帮助病人了解自己的健康状况。
Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kwon, Yuan Zhang, Zonghai Yao, Hong Yu

Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel 'learning as conversation' framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight 3B-parameter LLaMA 3.2 model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education-such as clarity, relevance, and structured dialogue-even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains-broadening the applicability of RL-based alignment methods.

患者必须具备积极参与护理的必要知识。我们介绍了NoteAid-Chatbot,这是一种会话式人工智能,通过一种新颖的“学习即对话”框架来促进患者的理解,该框架建立在多智能体大型语言模型(LLM)和强化学习(RL)的基础上,没有人类标记的数据。NoteAid-Chatbot建立在轻量级的3b参数LLaMA 3.2模型上,该模型分为两个阶段进行训练:对使用医疗会话策略合成的会话数据进行初始监督微调,然后在模拟出院场景中对患者的理解评估进行强化学习,并获得奖励。我们的评估,包括全面的与人类一致的评估和案例研究,表明NoteAid-Chatbot表现出对患者教育至关重要的关键紧急行为,如清晰度、相关性和结构化对话,尽管它没有对这些属性进行明确的监督。我们的研究结果表明,即使是简单的基于近端策略优化(PPO)的奖励建模,也可以成功地训练轻量级的、特定领域的聊天机器人来处理多回合交互,结合不同的教育策略,并满足细微的沟通目标。我们的图灵测试表明,NoteAid-Chatbot超越了非专业的人类。虽然我们目前的重点是医疗保健,但我们提出的框架说明了将低成本、基于ppo的强化学习应用于现实的开放式会话领域的可行性和前景——扩大了基于强化学习的校准方法的适用性。
{"title":"Chatbot To Help Patients Understand Their Health.","authors":"Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kwon, Yuan Zhang, Zonghai Yao, Hong Yu","doi":"10.18653/v1/2025.findings-emnlp.351","DOIUrl":"10.18653/v1/2025.findings-emnlp.351","url":null,"abstract":"<p><p>Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel 'learning as conversation' framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight 3B-parameter LLaMA 3.2 model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education-such as clarity, relevance, and structured dialogue-even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains-broadening the applicability of RL-based alignment methods.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"EMNLP 2025 ","pages":"6598-6627"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716312/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145806672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification. 简单而有效:一种多llm不确定性量化的信息论方法。
Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.

大型语言模型(llm)在输入之间的行为往往不一致,这表明了不确定性,并激发了在高风险设置中对其量化的需求。先前关于校准和不确定度量化的工作往往侧重于单个模型,而忽略了模型多样性的潜力。我们假设llm由于训练的差异和语言的Zipfian性质而进行互补预测,并且汇总它们的输出导致更可靠的不确定性估计。为了利用这一点,我们提出了MUSE (Multi-LLM Uncertainty via Subset Ensembles),这是一种简单的信息论方法,使用Jensen-Shannon Divergence来识别和聚合校准良好的llm子集。与单模型和naïve集合基线相比,二元预测任务的实验证明了更好的校准和预测性能。此外,我们探索使用MUSE作为引导信号,通过思维链蒸馏对llm进行微调以进行校准。MUSE网站:https://github.com/LARK-NLP-Lab/MUSE。
{"title":"Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification.","authors":"Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao","doi":"10.18653/v1/2025.emnlp-main.1551","DOIUrl":"10.18653/v1/2025.emnlp-main.1551","url":null,"abstract":"<p><p>Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2025 ","pages":"30481-30492"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction. 精神病学短期再入院预测的面向方面总结。
WonJin Yoon, Boyu Ren, Spencer Thomas, Chanhwi Kim, Guergana Savova, Mei-Hua Hall, Timothy Miller

Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different information signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task - 30-day readmission prediction from a psychiatric discharge - using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.

大型语言模型(llm)的最新进展使长文档的自动处理成为可能,即使没有在特定任务数据集上进行监督训练。然而,与直接的信息提取任务相比,它们在复杂任务中的表现仍然不够理想。对于具有冗长、复杂输入的任务,一种可行的方法是首先总结文档,然后对摘要应用监督微调。然而,总结过程不可避免地会造成一些信息的丢失。在这项研究中,我们提出了一种处理长文件摘要的方法,旨在捕捉原始文件的不同重要方面。我们假设使用不同面向方面的提示生成的LLM摘要包含不同的信息信号,并提出了测量这些差异的方法。我们介绍了有效整合来自这些不同摘要的信号的方法,用于变压器模型的监督训练。我们使用来自四家医院的真实数据验证了我们对高影响任务的假设-从精神科出院的30天再入院预测,并表明我们提出的方法提高了预测患者预后这一复杂任务的预测性能。
{"title":"Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction.","authors":"WonJin Yoon, Boyu Ren, Spencer Thomas, Chanhwi Kim, Guergana Savova, Mei-Hua Hall, Timothy Miller","doi":"10.18653/v1/2025.emnlp-main.1423","DOIUrl":"10.18653/v1/2025.emnlp-main.1423","url":null,"abstract":"<p><p>Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different <i>information signals</i>, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task - 30-day readmission prediction from a psychiatric discharge - using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2025 ","pages":"28037-28054"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models. LEAF:通过事实检查增强学习和评估,以提高大型语言模型的真实性。
Hieu Tran, Junda Wang, Yujan Ting, Hong Yu, Weijing Huang, Terrence Chen

Large language models (LLMs) often struggle with factual accuracy in knowledge-intensive domains like healthcare. We introduce LEAF (Learning and Evaluation Augmented by Fact-Checking), a framework for improving LLM factuality in medical question answering. LEAF comprises three components: (1) RAFE, a robust fact-checking system using open-source LLMs and domain-specific retrieval to evaluate response accuracy; (2) Fact-Check-then-RAG, which leverages fact-checking results to guide retrieval without parameter updates; and (3) Learning from Fact Check, enabling self-training through supervised fine-tuning or preference-based learning using fact-checking as pseudo-labels. Experimental results show that RAFE outperforms Factcheck-GPT in detecting inaccuracies, Fact-Check-then-RAG effectively corrects errors, and Learning from Fact Check improves performance without labeled data. In a real-world healthcare deployment with proprietary medical documents, LEAF achieved an 83% improvement in factuality scores, demonstrating practical applicability for adapting general-purpose LLMs to organization-specific knowledge. Our framework provides a scalable solution for industrial applications requiring high factual accuracy.

大型语言模型(llm)经常在医疗保健等知识密集型领域与事实准确性作斗争。我们介绍LEAF(学习和评估增强事实检查),一个框架,以提高法学硕士在医学问题回答的真实性。LEAF由三个部分组成:(1)RAFE,一个健壮的事实核查系统,使用开源法学硕士和特定领域检索来评估响应的准确性;(2) Fact-Check-then-RAG,利用事实核查结果指导检索,无需更新参数;(3)从事实检查中学习,通过监督微调或使用事实检查作为伪标签的基于偏好的学习实现自我训练。实验结果表明,RAFE在检测不准确方面优于Factcheck-GPT, Fact-Check-then- rag有效地纠正了错误,而从事实检查中学习提高了没有标记数据的性能。在使用专有医疗文档的实际医疗保健部署中,LEAF在事实得分方面提高了83%,证明了将通用法学硕士应用于组织特定知识的实际适用性。我们的框架为需要高事实准确性的工业应用提供了可扩展的解决方案。
{"title":"LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models.","authors":"Hieu Tran, Junda Wang, Yujan Ting, Hong Yu, Weijing Huang, Terrence Chen","doi":"10.18653/v1/2025.emnlp-industry.23","DOIUrl":"https://doi.org/10.18653/v1/2025.emnlp-industry.23","url":null,"abstract":"<p><p>Large language models (LLMs) often struggle with factual accuracy in knowledge-intensive domains like healthcare. We introduce LEAF (Learning and Evaluation Augmented by Fact-Checking), a framework for improving LLM factuality in medical question answering. LEAF comprises three components: (1) <b>RAFE</b>, a robust fact-checking system using open-source LLMs and domain-specific retrieval to evaluate response accuracy; (2) <b>Fact-Check-then-RAG</b>, which leverages fact-checking results to guide retrieval without parameter updates; and (3) <b>Learning from Fact Check</b>, enabling self-training through supervised fine-tuning or preference-based learning using fact-checking as pseudo-labels. Experimental results show that RAFE outperforms Factcheck-GPT in detecting inaccuracies, Fact-Check-then-RAG effectively corrects errors, and Learning from Fact Check improves performance without labeled data. In a real-world healthcare deployment with proprietary medical documents, LEAF achieved an 83% improvement in factuality scores, demonstrating practical applicability for adapting general-purpose LLMs to organization-specific knowledge. Our framework provides a scalable solution for industrial applications requiring high factual accuracy.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2025 Industry Track","pages":"338-363"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12878983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146144888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization. synfacedit:合成模仿编辑反馈,用于临床总结中的事实对齐。
Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu
{"title":"SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization.","authors":"Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu","doi":"10.18653/v1/2024.emnlp-main.1120","DOIUrl":"10.18653/v1/2024.emnlp-main.1120","url":null,"abstract":"","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"20061-20083"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12854549/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records. EHRAgent:代码授权大型语言模型对电子健康记录进行少量复杂表格推理。
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D Wang

Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.

临床医生通常依靠数据工程师从电子健康记录(EHR)系统中检索复杂的患者信息,这一过程既低效又耗时。提出了一种具有领域知识积累和鲁棒编码能力的大型语言模型(LLM)智能体EHRAgent。EHRAgent支持自主代码生成和执行,以方便临床医生使用自然语言直接与ehrrs交互。具体来说,我们制定了一个基于电子病历的多表格推理任务作为工具使用规划过程,有效地将复杂的任务分解为一系列使用外部工具集的可管理操作。我们首先注入相关的医疗信息,使EHRAgent能够有效地推断给定的查询,从适当的表中识别和提取所需的记录。通过集成交互式编码和执行反馈,EHRAgent可以有效地从错误消息中学习,并迭代地改进其原始生成的代码。在三个真实的EHR数据集上的实验表明,EHRAgent的成功率比最强基线高出29.6%,验证了其以最少的演示处理复杂临床任务的强大能力。
{"title":"EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records.","authors":"Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D Wang","doi":"10.18653/v1/2024.emnlp-main.1245","DOIUrl":"10.18653/v1/2024.emnlp-main.1245","url":null,"abstract":"<p><p>Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"22315-22339"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11867733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143525484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization. APPLS:评估简单语言总结的评估指标。
Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.

虽然对于简单语言总结(PLS)的模型已经有了很大的发展,但是评估仍然是一个挑战。PLS缺乏专门的评估指标,并且由于所涉及的独特转换(例如,添加背景解释,删除术语),文本生成评估指标的适用性不明确。为了解决这些问题,我们的研究引入了一个颗粒元评估测试平台,APPLS,旨在评估PLS的指标。我们从以前的工作中确定了四个PLS标准——信息性、简化性、一致性和忠实性——并定义了一组与这些标准相对应的扰动,敏感指标应该能够检测到这些标准。我们将这些扰动应用于两个PLS数据集的文本来创建我们的测试平台。使用APPLS,我们评估了14个指标的性能,包括自动评分、词汇特征和基于LLM提示的评估。我们的分析表明,虽然当前的一些指标显示出对特定标准的敏感性,但没有一种方法可以同时捕获所有四个标准。因此,我们建议使用一套自动化的度量标准来捕获所有相关标准的PLS质量。这项工作为PLS提供了第一个元评估测试平台,并对现有指标进行了全面评估。
{"title":"APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.","authors":"Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang","doi":"10.18653/v1/2024.emnlp-main.519","DOIUrl":"10.18653/v1/2024.emnlp-main.519","url":null,"abstract":"<p><p>While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"9194-9211"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938995/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143722841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment. readme++:对多领域可读性评估的多语言语言模型进行基准测试。
Tarek Naous, Michael J Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme.

我们提出了一个综合评估的大型语言模型的多语言可读性评估。现有评价资源缺乏领域和语言的多样性,限制了跨领域和跨语言分析的能力。本文介绍了一个多语种多域数据集readme++,该数据集对来自112个不同数据源的阿拉伯语、英语、法语、印地语和俄语的9757个句子进行了人工注释。这一基准将鼓励开发稳健的多语言可读性评估方法的研究。使用readme++,我们在有监督、无监督和少量提示设置下对多语言和单语言模型进行基准测试。readme++中的领域和语言多样性使我们能够测试更有效的少量提示,并识别最先进的无监督方法的缺点。我们的实验还揭示了在readme++上训练的模型在领域泛化和跨语言迁移能力方面的令人兴奋的结果。我们将公开我们的数据,并发布一个python包工具,用于使用我们训练过的模型进行多语言句子可读性预测:https://github.com/tareknaous/readme。
{"title":"ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment.","authors":"Tarek Naous, Michael J Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu","doi":"10.18653/v1/2024.emnlp-main.682","DOIUrl":"10.18653/v1/2024.emnlp-main.682","url":null,"abstract":"<p><p>We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"12230-12266"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12225862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Minimum Bayes Risk Decoding with Multi-Prompt. 多提示改进最小贝叶斯风险解码。
David Heineman, Yao Dou, Wei Xu

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single 'best' prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks (Figure 1), and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.

虽然指令微调llm是有效的文本生成器,但对提示结构的敏感性使得性能在实践中不稳定和次优。依赖于单一的“最佳”提示不能捕获所有不同的方法来生成问题。利用这种观察,我们提出了多提示解码,其中在推理时从提示库中解码许多候选代。为了集成候选,我们使用最小贝叶斯风险(MBR)解码,它使用训练值度量选择最终输出。我们展示了多提示在一组全面的条件生成任务中提高了MBR(图1),并展示了这是估计比单个提示更多样化和更高质量的候选空间的结果。进一步的实验证实,多提示可以改善跨任务、模型和指标的生成。
{"title":"Improving Minimum Bayes Risk Decoding with Multi-Prompt.","authors":"David Heineman, Yao Dou, Wei Xu","doi":"10.18653/v1/2024.emnlp-main.1255","DOIUrl":"10.18653/v1/2024.emnlp-main.1255","url":null,"abstract":"<p><p>While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single 'best' prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose <i>multi-prompt</i> decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks (Figure 1), and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"22525-22545"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226151/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1