首页 > 最新文献

Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing最新文献

英文 中文
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification. 简单而有效:一种多llm不确定性量化的信息论方法。
Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao

Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.

大型语言模型(llm)在输入之间的行为往往不一致,这表明了不确定性,并激发了在高风险设置中对其量化的需求。先前关于校准和不确定度量化的工作往往侧重于单个模型,而忽略了模型多样性的潜力。我们假设llm由于训练的差异和语言的Zipfian性质而进行互补预测,并且汇总它们的输出导致更可靠的不确定性估计。为了利用这一点,我们提出了MUSE (Multi-LLM Uncertainty via Subset Ensembles),这是一种简单的信息论方法,使用Jensen-Shannon Divergence来识别和聚合校准良好的llm子集。与单模型和naïve集合基线相比,二元预测任务的实验证明了更好的校准和预测性能。此外,我们探索使用MUSE作为引导信号,通过思维链蒸馏对llm进行微调以进行校准。MUSE网站:https://github.com/LARK-NLP-Lab/MUSE。
{"title":"Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification.","authors":"Maya Kruse, Majid Afshar, Saksham Khatwani, Anoop Mayampurath, Guanhua Chen, Yanjun Gao","doi":"10.18653/v1/2025.emnlp-main.1551","DOIUrl":"10.18653/v1/2025.emnlp-main.1551","url":null,"abstract":"<p><p>Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naïve ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at:https://github.com/LARK-NLP-Lab/MUSE.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2025 ","pages":"30481-30492"},"PeriodicalIF":0.0,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12702469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records. EHRAgent:代码授权大型语言模型对电子健康记录进行少量复杂表格推理。
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D Wang

Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.

临床医生通常依靠数据工程师从电子健康记录(EHR)系统中检索复杂的患者信息,这一过程既低效又耗时。提出了一种具有领域知识积累和鲁棒编码能力的大型语言模型(LLM)智能体EHRAgent。EHRAgent支持自主代码生成和执行,以方便临床医生使用自然语言直接与ehrrs交互。具体来说,我们制定了一个基于电子病历的多表格推理任务作为工具使用规划过程,有效地将复杂的任务分解为一系列使用外部工具集的可管理操作。我们首先注入相关的医疗信息,使EHRAgent能够有效地推断给定的查询,从适当的表中识别和提取所需的记录。通过集成交互式编码和执行反馈,EHRAgent可以有效地从错误消息中学习,并迭代地改进其原始生成的代码。在三个真实的EHR数据集上的实验表明,EHRAgent的成功率比最强基线高出29.6%,验证了其以最少的演示处理复杂临床任务的强大能力。
{"title":"EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records.","authors":"Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D Wang","doi":"10.18653/v1/2024.emnlp-main.1245","DOIUrl":"10.18653/v1/2024.emnlp-main.1245","url":null,"abstract":"<p><p>Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use planning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"22315-22339"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11867733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143525484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization. APPLS:评估简单语言总结的评估指标。
Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang

While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.

虽然对于简单语言总结(PLS)的模型已经有了很大的发展,但是评估仍然是一个挑战。PLS缺乏专门的评估指标,并且由于所涉及的独特转换(例如,添加背景解释,删除术语),文本生成评估指标的适用性不明确。为了解决这些问题,我们的研究引入了一个颗粒元评估测试平台,APPLS,旨在评估PLS的指标。我们从以前的工作中确定了四个PLS标准——信息性、简化性、一致性和忠实性——并定义了一组与这些标准相对应的扰动,敏感指标应该能够检测到这些标准。我们将这些扰动应用于两个PLS数据集的文本来创建我们的测试平台。使用APPLS,我们评估了14个指标的性能,包括自动评分、词汇特征和基于LLM提示的评估。我们的分析表明,虽然当前的一些指标显示出对特定标准的敏感性,但没有一种方法可以同时捕获所有四个标准。因此,我们建议使用一套自动化的度量标准来捕获所有相关标准的PLS质量。这项工作为PLS提供了第一个元评估测试平台,并对现有指标进行了全面评估。
{"title":"APPLS: Evaluating Evaluation Metrics for Plain Language Summarization.","authors":"Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang","doi":"10.18653/v1/2024.emnlp-main.519","DOIUrl":"10.18653/v1/2024.emnlp-main.519","url":null,"abstract":"<p><p>While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work-informativeness, simplification, coherence, and faithfulness-and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to the texts of two PLS datasets to create our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"9194-9211"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11938995/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143722841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment. readme++:对多领域可读性评估的多语言语言模型进行基准测试。
Tarek Naous, Michael J Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme.

我们提出了一个综合评估的大型语言模型的多语言可读性评估。现有评价资源缺乏领域和语言的多样性,限制了跨领域和跨语言分析的能力。本文介绍了一个多语种多域数据集readme++,该数据集对来自112个不同数据源的阿拉伯语、英语、法语、印地语和俄语的9757个句子进行了人工注释。这一基准将鼓励开发稳健的多语言可读性评估方法的研究。使用readme++,我们在有监督、无监督和少量提示设置下对多语言和单语言模型进行基准测试。readme++中的领域和语言多样性使我们能够测试更有效的少量提示,并识别最先进的无监督方法的缺点。我们的实验还揭示了在readme++上训练的模型在领域泛化和跨语言迁移能力方面的令人兴奋的结果。我们将公开我们的数据,并发布一个python包工具,用于使用我们训练过的模型进行多语言句子可读性预测:https://github.com/tareknaous/readme。
{"title":"ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment.","authors":"Tarek Naous, Michael J Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu","doi":"10.18653/v1/2024.emnlp-main.682","DOIUrl":"10.18653/v1/2024.emnlp-main.682","url":null,"abstract":"<p><p>We present a comprehensive evaluation of large language models for multilingual readability assessment. Existing evaluation resources lack domain and language diversity, limiting the ability for cross-domain and cross-lingual analyses. This paper introduces ReadMe++, a multilingual multi-domain dataset with human annotations of 9757 sentences in Arabic, English, French, Hindi, and Russian, collected from 112 different data sources. This benchmark will encourage research on developing robust multilingual readability assessment methods. Using ReadMe++, we benchmark multilingual and monolingual language models in the supervised, unsupervised, and few-shot prompting settings. The domain and language diversity in ReadMe++ enable us to test more effective few-shot prompting, and identify shortcomings in state-of-the-art unsupervised methods. Our experiments also reveal exciting results of superior domain generalization and enhanced cross-lingual transfer capabilities by models trained on ReadMe++. We will make our data publicly available and release a python package tool for multilingual sentence readability prediction using our trained models at: https://github.com/tareknaous/readme.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"12230-12266"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12225862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Minimum Bayes Risk Decoding with Multi-Prompt. 多提示改进最小贝叶斯风险解码。
David Heineman, Yao Dou, Wei Xu

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single 'best' prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks (Figure 1), and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.

虽然指令微调llm是有效的文本生成器,但对提示结构的敏感性使得性能在实践中不稳定和次优。依赖于单一的“最佳”提示不能捕获所有不同的方法来生成问题。利用这种观察,我们提出了多提示解码,其中在推理时从提示库中解码许多候选代。为了集成候选,我们使用最小贝叶斯风险(MBR)解码,它使用训练值度量选择最终输出。我们展示了多提示在一组全面的条件生成任务中提高了MBR(图1),并展示了这是估计比单个提示更多样化和更高质量的候选空间的结果。进一步的实验证实,多提示可以改善跨任务、模型和指标的生成。
{"title":"Improving Minimum Bayes Risk Decoding with Multi-Prompt.","authors":"David Heineman, Yao Dou, Wei Xu","doi":"10.18653/v1/2024.emnlp-main.1255","DOIUrl":"10.18653/v1/2024.emnlp-main.1255","url":null,"abstract":"<p><p>While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single 'best' prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose <i>multi-prompt</i> decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks (Figure 1), and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"22525-22545"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226151/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain. 医学领域细粒度句子可读性的系统研究。
Chao Jiang, Wei Xu

Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessible. In this paper, we present a systematic study on fine-grained readability measurements in the medical domain at both sentence-level and span-level. We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences, featuring two novel "Google-Easy" and "Google-Hard" categories. It supports our quantitative analysis, which covers 650 linguistic features and automatic complex word and jargon identification. Enabled by our high-quality annotation, we benchmark and improve several state-of-the-art sentence-level readability metrics for the medical domain specifically, which include unsupervised, supervised, and prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation, we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments. We will publicly release the dataset and code.

医学文献是出了名的难读。正确测量它们的可读性是使它们更易于访问的第一步。在本文中,我们提出了一个系统的研究细粒度可读性测量在医学领域在句子水平和跨度水平。我们引入了一个新的数据集MedReadMe,它由4520个句子的手动注释可读性评级和细粒度复杂跨度注释组成,具有两个新颖的“Google-Easy”和“Google-Hard”类别。它支持我们的定量分析,涵盖650个语言特征和自动复杂的单词和术语识别。通过我们的高质量注释,我们对医学领域的几个最先进的句子级可读性指标进行了基准测试和改进,其中包括使用最近开发的大型语言模型(llm)的无监督、有监督和基于提示的方法。通过我们的细粒度复杂跨度标注,我们发现在现有的可读性公式中添加一个特征,即捕获术语跨度的数量,可以显著提高它们与人类判断的相关性。我们将公开发布数据集和代码。
{"title":"MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain.","authors":"Chao Jiang, Wei Xu","doi":"10.18653/v1/2024.emnlp-main.958","DOIUrl":"10.18653/v1/2024.emnlp-main.958","url":null,"abstract":"<p><p>Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessible. In this paper, we present a systematic study on fine-grained readability measurements in the medical domain at both sentence-level and span-level. We introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex span annotation for 4,520 sentences, featuring two novel \"Google-Easy\" and \"Google-Hard\" categories. It supports our quantitative analysis, which covers 650 linguistic features and automatic complex word and jargon identification. Enabled by our high-quality annotation, we benchmark and improve several state-of-the-art sentence-level readability metrics for the medical domain specifically, which include unsupervised, supervised, and prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation, we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their correlation with human judgments. We will publicly release the dataset and code.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"17293-17319"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12225841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Text Generation using Large Language Models for Dementia Detection. 使用大型语言模型的对抗文本生成用于痴呆检测。
Youxiang Zhu, Nana Lin, Kiran Sandilya Balivada, Daniel Haehn, Xiaohui Liang

Although large language models (LLMs) excel in various text classification tasks, regular prompting strategies (e.g., few-shot prompting) do not work well with dementia detection via picture description. The challenge lies in the language marks for dementia are unclear, and LLM may struggle with relating its internal knowledge to dementia detection. In this paper, we present an accurate and interpretable classification approach by Adversarial Text Generation (ATG), a novel decoding strategy that could relate dementia detection with other tasks. We further develop a comprehensive set of instructions corresponding to various tasks and use them to guide ATG, achieving the best accuracy of 85%, >10% improvement compared to the regular prompting strategies. In addition, we introduce feature context, a human-understandable text that reveals the underlying features of LLM used for classifying dementia. From feature contexts, we found that dementia detection can be related to tasks such as assessing attention to detail, language, and clarity with specific features of the environment, character, and other picture content or language-related features. Future work includes incorporating multi-modal LLMs to interpret speech and picture information.

尽管大型语言模型(llm)在各种文本分类任务中表现出色,但常规提示策略(例如,few-shot提示)在通过图片描述检测痴呆症方面效果不佳。挑战在于痴呆症的语言标记尚不清楚,LLM可能难以将其内部知识与痴呆症检测联系起来。在本文中,我们提出了一种通过对抗性文本生成(ATG)的准确和可解释的分类方法,这是一种新的解码策略,可以将痴呆症检测与其他任务联系起来。我们进一步开发了一套对应于各种任务的综合指令,并使用它们来指导ATG,达到了85%的最佳准确率,比常规提示策略提高了100 - 10%。此外,我们还引入了特征上下文,这是一种人类可理解的文本,揭示了用于分类痴呆症的LLM的潜在特征。从特征上下文中,我们发现痴呆症检测可以与评估对细节、语言和清晰度的注意等任务有关,这些任务与环境、角色和其他图片内容或语言相关特征的特定特征有关。未来的工作包括整合多模态llm来解释语音和图像信息。
{"title":"Adversarial Text Generation using Large Language Models for Dementia Detection.","authors":"Youxiang Zhu, Nana Lin, Kiran Sandilya Balivada, Daniel Haehn, Xiaohui Liang","doi":"10.18653/v1/2024.emnlp-main.1222","DOIUrl":"10.18653/v1/2024.emnlp-main.1222","url":null,"abstract":"<p><p>Although large language models (LLMs) excel in various text classification tasks, regular prompting strategies (e.g., few-shot prompting) do not work well with dementia detection via picture description. The challenge lies in the language marks for dementia are unclear, and LLM may struggle with relating its internal knowledge to dementia detection. In this paper, we present an accurate and interpretable classification approach by Adversarial Text Generation (ATG), a novel decoding strategy that could relate dementia detection with other tasks. We further develop a comprehensive set of instructions corresponding to various tasks and use them to guide ATG, achieving the best accuracy of 85%, >10% improvement compared to the regular prompting strategies. In addition, we introduce feature context, a human-understandable text that reveals the underlying features of LLM used for classifying dementia. From feature contexts, we found that dementia detection can be related to tasks such as assessing attention to detail, language, and clarity with specific features of the environment, character, and other picture content or language-related features. Future work includes incorporating multi-modal LLMs to interpret speech and picture information.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"21918-21933"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12439105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedAdapter: Efficient Test-Time Adaptation of Large Language Models Towards Medical Reasoning. MedAdapter:大型语言模型对医学推理的有效测试时间适应。
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Haotian Sun, Hang Wu, Carl Yang, May D Wang

Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains challenging due to their immense size and privacy concerns. In this study, we propose MedAdapter, a unified post-hoc adapter for test-time adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments on four biomedical tasks across eight datasets demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning, achieving average performance improvements of 18.24% and 10.96%, respectively, without requiring extensive computational resources or sharing data with third parties. MedAdapter also yields enhanced performance when combined with train-time adaptation, highlighting a flexible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs to the biomedical domain.

尽管它们在生成和推理方面的能力有所提高,但由于其巨大的规模和隐私问题,将大型语言模型(llm)应用于生物医学领域仍然具有挑战性。在这项研究中,我们提出了MedAdapter,一个统一的事后适配器,用于llm的测试时间适应生物医学应用。MedAdapter没有对整个LLM进行微调,而是通过微调一个bert大小的小适配器来对LLM生成的候选解决方案进行排序,从而有效地调整了原始模型。在8个数据集的4个生物医学任务上进行的实验表明,MedAdapter有效地适应了生物医学推理中的白盒和黑盒llm,在不需要大量计算资源或与第三方共享数据的情况下,平均性能分别提高了18.24%和10.96%。当与列车时间适应相结合时,MedAdapter也产生了增强的性能,突出了现有适应方法的灵活和互补解决方案。面对平衡模型性能、计算资源和数据隐私的挑战,MedAdapter为法学硕士适应生物医学领域提供了一种高效、隐私保护、经济高效和透明的解决方案。
{"title":"MedAdapter: Efficient Test-Time Adaptation of Large Language Models Towards Medical Reasoning.","authors":"Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Haotian Sun, Hang Wu, Carl Yang, May D Wang","doi":"10.18653/v1/2024.emnlp-main.1244","DOIUrl":"10.18653/v1/2024.emnlp-main.1244","url":null,"abstract":"<p><p>Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains challenging due to their immense size and privacy concerns. In this study, we propose MedAdapter, a unified post-hoc adapter for test-time adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments on four biomedical tasks across eight datasets demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning, achieving average performance improvements of 18.24% and 10.96%, respectively, without requiring extensive computational resources or sharing data with third parties. MedAdapter also yields enhanced performance when combined with train-time adaptation, highlighting a flexible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs to the biomedical domain.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2024 ","pages":"22294-22314"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11868705/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143544192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comprehensive Evaluation of Biomedical Entity Linking Models. 生物医学实体链接模型的综合评估。
David Kartchner, Jennifer Deng, Shubham Lohiya, Tejasri Kopparthi, Prasanth Bathala, Daniel Domingo-Fernández, Cassie S Mitchell

Biomedical entity linking (BioEL) is the process of connecting entities referenced in documents to entries in biomedical databases such as the Unified Medical Language System (UMLS) or Medical Subject Headings (MeSH). The study objective was to comprehensively evaluate nine recent state-of-the-art biomedical entity linking models under a unified framework. We compare these models along axes of (1) accuracy, (2) speed, (3) ease of use, (4) generalization, and (5) adaptability to new ontologies and datasets. We additionally quantify the impact of various preprocessing choices such as abbreviation detection. Systematic evaluation reveals several notable gaps in current methods. In particular, current methods struggle to correctly link genes and proteins and often have difficulty effectively incorporating context into linking decisions. To expedite future development and baseline testing, we release our unified evaluation framework and all included models on GitHub at https://github.com/davidkartchner/biomedical-entity-linking.

生物医学实体链接(BioEL)是将文档中引用的实体与统一医学语言系统(UMLS)或医学主题词表(MeSH)等生物医学数据库中的条目连接起来的过程。研究的目的是在一个统一的框架下全面评估九种最新的生物医学实体链接模型。我们从以下几个方面对这些模型进行了比较:(1) 准确性;(2) 速度;(3) 易用性;(4) 通用性;(5) 对新本体和数据集的适应性。此外,我们还量化了各种预处理选择(如缩写检测)的影响。系统评估揭示了当前方法中存在的几个显著缺陷。特别是,目前的方法很难正确地连接基因和蛋白质,而且往往难以有效地将上下文纳入连接决策。为了加快未来的开发和基线测试,我们在 GitHub 上发布了统一的评估框架和所有包含的模型,网址是 https://github.com/davidkartchner/biomedical-entity-linking。
{"title":"A Comprehensive Evaluation of Biomedical Entity Linking Models.","authors":"David Kartchner, Jennifer Deng, Shubham Lohiya, Tejasri Kopparthi, Prasanth Bathala, Daniel Domingo-Fernández, Cassie S Mitchell","doi":"10.18653/v1/2023.emnlp-main.893","DOIUrl":"https://doi.org/10.18653/v1/2023.emnlp-main.893","url":null,"abstract":"<p><p>Biomedical entity linking (BioEL) is the process of connecting entities referenced in documents to entries in biomedical databases such as the Unified Medical Language System (UMLS) or Medical Subject Headings (MeSH). The study objective was to comprehensively evaluate nine recent state-of-the-art biomedical entity linking models under a unified framework. We compare these models along axes of (1) accuracy, (2) speed, (3) ease of use, (4) generalization, and (5) adaptability to new ontologies and datasets. We additionally quantify the impact of various preprocessing choices such as abbreviation detection. Systematic evaluation reveals several notable gaps in current methods. In particular, current methods struggle to correctly link genes and proteins and often have difficulty effectively incorporating context into linking decisions. To expedite future development and baseline testing, we release our unified evaluation framework and all included models on GitHub at https://github.com/davidkartchner/biomedical-entity-linking.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2023 ","pages":"14462-14478"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11097978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140961102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Pretraining on Multimodal Electronic Health Records. 多模态电子健康记录的分层预培训。
Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, Fenglong Ma

Pretraining has proven to be a powerful technique in natural language processing (NLP), exhibiting remarkable success in various NLP downstream tasks. However, in the medical domain, existing pretrained models on electronic health records (EHR) fail to capture the hierarchical nature of EHR data, limiting their generalization capability across diverse downstream tasks using a single pretrained model. To tackle this challenge, this paper introduces a novel, general, and unified pretraining framework called MedHMP, specifically designed for hierarchically multimodal EHR data. The effectiveness of the proposed MedHMP is demonstrated through experimental results on eight downstream tasks spanning three levels. Comparisons against eighteen baselines further highlight the efficacy of our approach.

事实证明,预训练是自然语言处理(NLP)领域的一项强大技术,在各种 NLP 下游任务中取得了显著的成功。然而,在医疗领域,现有的电子健康记录(EHR)预训练模型无法捕捉 EHR 数据的层次性,从而限制了使用单一预训练模型在不同下游任务中的泛化能力。为了应对这一挑战,本文介绍了一种新颖、通用和统一的预训练框架 MedHMP,它是专门为分层多模态电子病历数据设计的。通过对横跨三个层次的八个下游任务的实验结果,证明了所提出的 MedHMP 的有效性。与 18 个基线的比较进一步凸显了我们方法的功效。
{"title":"Hierarchical Pretraining on Multimodal Electronic Health Records.","authors":"Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, Fenglong Ma","doi":"10.18653/v1/2023.emnlp-main.171","DOIUrl":"https://doi.org/10.18653/v1/2023.emnlp-main.171","url":null,"abstract":"<p><p>Pretraining has proven to be a powerful technique in natural language processing (NLP), exhibiting remarkable success in various NLP downstream tasks. However, in the medical domain, existing pretrained models on electronic health records (EHR) fail to capture the hierarchical nature of EHR data, limiting their generalization capability across diverse downstream tasks using a single pretrained model. To tackle this challenge, this paper introduces a novel, general, and unified pretraining framework called MedHMP, specifically designed for hierarchically multimodal EHR data. The effectiveness of the proposed MedHMP is demonstrated through experimental results on eight downstream tasks spanning three levels. Comparisons against eighteen baselines further highlight the efficacy of our approach.</p>","PeriodicalId":74540,"journal":{"name":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2023 ","pages":"2839-2852"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11005845/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140873868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1