首页 > 最新文献

arXiv - CS - Computation and Language最新文献

英文 中文
Fine-tuning Large Language Models for Entity Matching 微调用于实体匹配的大型语言模型
Pub Date : 2024-09-12 DOI: arxiv-2409.08185
Aaron Steiner, Ralph Peeters, Christian Bizer
Generative large language models (LLMs) are a promising alternative topre-trained language models for entity matching due to their high zero-shotperformance and their ability to generalize to unseen entities. Existingresearch on using LLMs for entity matching has focused on prompt engineeringand in-context learning. This paper explores the potential of fine-tuning LLMsfor entity matching. We analyze fine-tuning along two dimensions: 1) Therepresentation of training examples, where we experiment with adding differenttypes of LLM-generated explanations to the training set, and 2) the selectionand generation of training examples using LLMs. In addition to the matchingperformance on the source dataset, we investigate how fine-tuning affects themodel's ability to generalize to other in-domain datasets as well as acrosstopical domains. Our experiments show that fine-tuning significantly improvesthe performance of the smaller models while the results for the larger modelsare mixed. Fine-tuning also improves the generalization to in-domain datasetswhile hurting cross-domain transfer. We show that adding structuredexplanations to the training set has a positive impact on the performance ofthree out of four LLMs, while the proposed example selection and generationmethods only improve the performance of Llama 3.1 8B while decreasing theperformance of GPT-4o Mini.
生成式大语言模型(LLMs)由于其较高的零点性能和泛化到未见实体的能力,在实体匹配方面是经过重新训练的语言模型的一种有前途的替代方案。关于使用 LLMs 进行实体匹配的现有研究主要集中在提示工程和上下文学习方面。本文探讨了微调 LLM 用于实体匹配的潜力。我们从两个方面对微调进行了分析:1)训练示例的呈现,我们尝试在训练集中添加不同类型的 LLM 生成的解释;2)使用 LLM 选择和生成训练示例。除了源数据集上的匹配性能外,我们还研究了微调如何影响模型泛化到其他领域内数据集以及跨领域的能力。我们的实验表明,微调显著提高了较小模型的性能,而较大模型的结果则参差不齐。微调还提高了对域内数据集的泛化,同时损害了跨域转移。我们的研究表明,在训练集中添加结构解释对四个 LLM 中三个模型的性能有积极影响,而所提出的示例选择和生成方法只提高了 Llama 3.1 8B 的性能,却降低了 GPT-4o Mini 的性能。
{"title":"Fine-tuning Large Language Models for Entity Matching","authors":"Aaron Steiner, Ralph Peeters, Christian Bizer","doi":"arxiv-2409.08185","DOIUrl":"https://doi.org/arxiv-2409.08185","url":null,"abstract":"Generative large language models (LLMs) are a promising alternative to\u0000pre-trained language models for entity matching due to their high zero-shot\u0000performance and their ability to generalize to unseen entities. Existing\u0000research on using LLMs for entity matching has focused on prompt engineering\u0000and in-context learning. This paper explores the potential of fine-tuning LLMs\u0000for entity matching. We analyze fine-tuning along two dimensions: 1) The\u0000representation of training examples, where we experiment with adding different\u0000types of LLM-generated explanations to the training set, and 2) the selection\u0000and generation of training examples using LLMs. In addition to the matching\u0000performance on the source dataset, we investigate how fine-tuning affects the\u0000model's ability to generalize to other in-domain datasets as well as across\u0000topical domains. Our experiments show that fine-tuning significantly improves\u0000the performance of the smaller models while the results for the larger models\u0000are mixed. Fine-tuning also improves the generalization to in-domain datasets\u0000while hurting cross-domain transfer. We show that adding structured\u0000explanations to the training set has a positive impact on the performance of\u0000three out of four LLMs, while the proposed example selection and generation\u0000methods only improve the performance of Llama 3.1 8B while decreasing the\u0000performance of GPT-4o Mini.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Controllable Synthetic Clinical Note Generation with Privacy Guarantees 可控合成临床笔记生成与隐私保证
Pub Date : 2024-09-12 DOI: arxiv-2409.07809
Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim
In the field of machine learning, domain-specific annotated data is aninvaluable resource for training effective models. However, in the medicaldomain, this data often includes Personal Health Information (PHI), raisingsignificant privacy concerns. The stringent regulations surrounding PHI limitthe availability and sharing of medical datasets, which poses a substantialchallenge for researchers and practitioners aiming to develop advanced machinelearning models. In this paper, we introduce a novel method to "clone" datasetscontaining PHI. Our approach ensures that the cloned datasets retain theessential characteristics and utility of the original data without compromisingpatient privacy. By leveraging differential-privacy techniques and a novelfine-tuning task, our method produces datasets that are free from identifiableinformation while preserving the statistical properties necessary for modeltraining. We conduct utility testing to evaluate the performance of machinelearning models trained on the cloned datasets. The results demonstrate thatour cloned datasets not only uphold privacy standards but also enhance modelperformance compared to those trained on traditional anonymized datasets. Thiswork offers a viable solution for the ethical and effective utilization ofsensitive medical data in machine learning, facilitating progress in medicalresearch and the development of robust predictive models.
在机器学习领域,特定领域的注释数据是训练有效模型的宝贵资源。然而,在医疗领域,这些数据通常包括个人健康信息(PHI),从而引发了重大的隐私问题。围绕 PHI 的严格法规限制了医疗数据集的可用性和共享性,这给旨在开发先进机器学习模型的研究人员和从业人员带来了巨大挑战。在本文中,我们介绍了一种 "克隆 "包含 PHI 的数据集的新方法。我们的方法可确保克隆数据集保留原始数据的基本特征和效用,同时不损害患者隐私。通过利用差分隐私技术和新颖的微调任务,我们的方法生成了不含可识别信息的数据集,同时保留了模型训练所需的统计属性。我们进行了实用性测试,以评估在克隆数据集上训练的机器学习模型的性能。结果表明,与在传统匿名数据集上训练的模型相比,我们的克隆数据集不仅维护了隐私标准,还提高了模型性能。这项工作为在机器学习中道德和有效地利用敏感医疗数据提供了可行的解决方案,促进了医学研究的进步和强大预测模型的开发。
{"title":"Controllable Synthetic Clinical Note Generation with Privacy Guarantees","authors":"Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim","doi":"arxiv-2409.07809","DOIUrl":"https://doi.org/arxiv-2409.07809","url":null,"abstract":"In the field of machine learning, domain-specific annotated data is an\u0000invaluable resource for training effective models. However, in the medical\u0000domain, this data often includes Personal Health Information (PHI), raising\u0000significant privacy concerns. The stringent regulations surrounding PHI limit\u0000the availability and sharing of medical datasets, which poses a substantial\u0000challenge for researchers and practitioners aiming to develop advanced machine\u0000learning models. In this paper, we introduce a novel method to \"clone\" datasets\u0000containing PHI. Our approach ensures that the cloned datasets retain the\u0000essential characteristics and utility of the original data without compromising\u0000patient privacy. By leveraging differential-privacy techniques and a novel\u0000fine-tuning task, our method produces datasets that are free from identifiable\u0000information while preserving the statistical properties necessary for model\u0000training. We conduct utility testing to evaluate the performance of machine\u0000learning models trained on the cloned datasets. The results demonstrate that\u0000our cloned datasets not only uphold privacy standards but also enhance model\u0000performance compared to those trained on traditional anonymized datasets. This\u0000work offers a viable solution for the ethical and effective utilization of\u0000sensitive medical data in machine learning, facilitating progress in medical\u0000research and the development of robust predictive models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice 尝试法律人工智能解决方案:司法救助问题解答案例
Pub Date : 2024-09-12 DOI: arxiv-2409.07713
Jonathan Li, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu
Generative AI models, such as the GPT and Llama series, have significantpotential to assist laypeople in answering legal questions. However, littleprior work focuses on the data sourcing, inference, and evaluation of thesemodels in the context of laypersons. To this end, we propose a human-centriclegal NLP pipeline, covering data sourcing, inference, and evaluation. Weintroduce and release a dataset, LegalQA, with real and specific legalquestions spanning from employment law to criminal law, corresponding answerswritten by legal experts, and citations for each answer. We develop anautomatic evaluation protocol for this dataset, then show thatretrieval-augmented generation from only 850 citations in the train set canmatch or outperform internet-wide retrieval, despite containing 9 orders ofmagnitude less data. Finally, we propose future directions for open-sourcedefforts, which fall behind closed-sourced models.
生成式人工智能模型(如 GPT 和 Llama 系列)在帮助非专业人士回答法律问题方面具有巨大潜力。然而,之前很少有研究关注这些模型在非专业人士语境下的数据来源、推理和评估。为此,我们提出了一个以人为中心的法律 NLP 管道,涵盖数据来源、推理和评估。我们引入并发布了一个名为 LegalQA 的数据集,其中包含从劳动法到刑法的真实而具体的法律问题、法律专家撰写的相应答案以及每个答案的引文。我们为该数据集开发了一个自动评估协议,然后表明,尽管数据量少了 9 个数量级,但从训练集中仅 850 条引文中生成的检索增强结果可以与整个互联网的检索结果相媲美,甚至更胜一筹。最后,我们为落后于闭源模型的开源努力提出了未来的方向。
{"title":"Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice","authors":"Jonathan Li, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu","doi":"arxiv-2409.07713","DOIUrl":"https://doi.org/arxiv-2409.07713","url":null,"abstract":"Generative AI models, such as the GPT and Llama series, have significant\u0000potential to assist laypeople in answering legal questions. However, little\u0000prior work focuses on the data sourcing, inference, and evaluation of these\u0000models in the context of laypersons. To this end, we propose a human-centric\u0000legal NLP pipeline, covering data sourcing, inference, and evaluation. We\u0000introduce and release a dataset, LegalQA, with real and specific legal\u0000questions spanning from employment law to criminal law, corresponding answers\u0000written by legal experts, and citations for each answer. We develop an\u0000automatic evaluation protocol for this dataset, then show that\u0000retrieval-augmented generation from only 850 citations in the train set can\u0000match or outperform internet-wide retrieval, despite containing 9 orders of\u0000magnitude less data. Finally, we propose future directions for open-sourced\u0000efforts, which fall behind closed-sourced models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Top-down Activity Representation Learning for Video Question Answering 用于视频问题解答的自顶向下活动表示学习
Pub Date : 2024-09-12 DOI: arxiv-2409.07748
Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa
Capturing complex hierarchical human activities, from atomic actions (e.g.,picking up one present, moving to the sofa, unwrapping the present) tocontextual events (e.g., celebrating Christmas) is crucial for achievinghigh-performance video question answering (VideoQA). Recent works have expandedmultimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,enhancing the model's temporal reasoning capabilities. However, theseapproaches often fail to capture contextual events that can be decomposed intomultiple atomic actions non-continuously distributed over relatively long-termsequences. In this paper, to leverage the spatial visual context representationcapability of the CLIP model for obtaining non-continuous visualrepresentations in terms of contextual events in videos, we convert long-termvideo sequences into a spatial image domain and finetune the multimodal modelLLaVA for the VideoQA task. Our approach achieves competitive performance onthe STAR task, in particular, with a 78.4% accuracy score, exceeding thecurrent state-of-the-art score by 2.8 points on the NExTQA task.
要实现高性能的视频问题解答(VideoQA),捕捉从原子动作(如拿起一件礼物、移到沙发上、拆开礼物)到上下文事件(如庆祝圣诞节)的复杂分层人类活动至关重要。最近的研究已经将多模态模型(如 CLIP、LLaVA)扩展到处理连续视频序列,从而增强了模型的时间推理能力。然而,这些方法往往无法捕捉到上下文事件,而这些事件可以分解成多个原子动作,非连续地分布在相对较长的序列中。在本文中,为了利用 CLIP 模型的空间视觉上下文表示能力来获取视频中上下文事件的非连续视觉表示,我们将长期视频序列转换为空间图像域,并针对视频质量保证任务对多模态模型LLaVA 进行了微调。我们的方法在 STAR 任务中取得了极具竞争力的性能,特别是在 NExTQA 任务中,准确率高达 78.4%,比目前最先进的方法高出 2.8 分。
{"title":"Top-down Activity Representation Learning for Video Question Answering","authors":"Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa","doi":"arxiv-2409.07748","DOIUrl":"https://doi.org/arxiv-2409.07748","url":null,"abstract":"Capturing complex hierarchical human activities, from atomic actions (e.g.,\u0000picking up one present, moving to the sofa, unwrapping the present) to\u0000contextual events (e.g., celebrating Christmas) is crucial for achieving\u0000high-performance video question answering (VideoQA). Recent works have expanded\u0000multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,\u0000enhancing the model's temporal reasoning capabilities. However, these\u0000approaches often fail to capture contextual events that can be decomposed into\u0000multiple atomic actions non-continuously distributed over relatively long-term\u0000sequences. In this paper, to leverage the spatial visual context representation\u0000capability of the CLIP model for obtaining non-continuous visual\u0000representations in terms of contextual events in videos, we convert long-term\u0000video sequences into a spatial image domain and finetune the multimodal model\u0000LLaVA for the VideoQA task. Our approach achieves competitive performance on\u0000the STAR task, in particular, with a 78.4% accuracy score, exceeding the\u0000current state-of-the-art score by 2.8 points on the NExTQA task.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ruri: Japanese General Text Embeddings Ruri:日语通用文本嵌入
Pub Date : 2024-09-12 DOI: arxiv-2409.07737
Hayato Tsukagoshi, Ryohei Sasano
We report the development of Ruri, a series of Japanese general textembedding models. While the development of general-purpose text embeddingmodels in English and multilingual contexts has been active in recent years,model development in Japanese remains insufficient. The primary reasons forthis are the lack of datasets and the absence of necessary expertise. In thisreport, we provide a detailed account of the development process of Ruri.Specifically, we discuss the training of embedding models using synthesizeddatasets generated by LLMs, the construction of the reranker for datasetfiltering and knowledge distillation, and the performance evaluation of theresulting general-purpose text embedding models.
我们报告了一系列日语通用文本嵌入模型 Ruri 的开发情况。近年来,英语和多语言环境下的通用文本嵌入模型的开发十分活跃,但日语模型的开发仍然不足。其主要原因是缺乏数据集和必要的专业知识。在本报告中,我们详细介绍了 Ruri 的开发过程。具体来说,我们讨论了使用 LLM 生成的合成数据集训练嵌入模型、构建用于数据集过滤和知识提炼的 reranker,以及对所生成的通用文本嵌入模型进行性能评估。
{"title":"Ruri: Japanese General Text Embeddings","authors":"Hayato Tsukagoshi, Ryohei Sasano","doi":"arxiv-2409.07737","DOIUrl":"https://doi.org/arxiv-2409.07737","url":null,"abstract":"We report the development of Ruri, a series of Japanese general text\u0000embedding models. While the development of general-purpose text embedding\u0000models in English and multilingual contexts has been active in recent years,\u0000model development in Japanese remains insufficient. The primary reasons for\u0000this are the lack of datasets and the absence of necessary expertise. In this\u0000report, we provide a detailed account of the development process of Ruri.\u0000Specifically, we discuss the training of embedding models using synthesized\u0000datasets generated by LLMs, the construction of the reranker for dataset\u0000filtering and knowledge distillation, and the performance evaluation of the\u0000resulting general-purpose text embedding models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models LLM-POTUS Score:利用大型语言模型分析总统辩论的框架
Pub Date : 2024-09-12 DOI: arxiv-2409.08147
Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu
Large language models have demonstrated remarkable capabilities in naturallanguage processing, yet their application to political discourse analysisremains underexplored. This paper introduces a novel approach to evaluatingpresidential debate performances using LLMs, addressing the longstandingchallenge of objectively assessing debate outcomes. We propose a framework thatanalyzes candidates' "Policies, Persona, and Perspective" (3P) and how theyresonate with the "Interests, Ideologies, and Identity" (3I) of four keyaudience groups: voters, businesses, donors, and politicians. Our methodemploys large language models to generate the LLM-POTUS Score, a quantitativemeasure of debate performance based on the alignment between 3P and 3I. Weapply this framework to analyze transcripts from recent U.S. presidentialdebates, demonstrating its ability to provide nuanced, multi-dimensionalassessments of candidate performances. Our results reveal insights into theeffectiveness of different debating strategies and their impact on variousaudience segments. This study not only offers a new tool for political analysisbut also explores the potential and limitations of using LLMs as impartialjudges in complex social contexts. In addition, this framework providesindividual citizens with an independent tool to evaluate presidential debateperformances, which enhances democratic engagement and reduces reliance onpotentially biased media interpretations and institutional influence, therebystrengthening the foundation of informed civic participation.
大型语言模型在自然语言处理方面已展现出非凡的能力,但其在政治话语分析中的应用仍未得到充分探索。本文介绍了一种利用大型语言模型评估总统辩论表现的新方法,解决了客观评估辩论结果这一长期难题。我们提出了一个框架,分析候选人的 "政策、角色和观点"(3P),以及这些政策、角色和观点如何与选民、企业、捐赠者和政治家这四个关键受众群体的 "利益、意识形态和身份"(3I)产生共鸣。我们的方法利用大型语言模型生成 LLM-POTUS Score,这是一种基于 3P 和 3I 之间一致性的辩论表现量化测量方法。我们将这一框架用于分析最近几场美国总统辩论的文字记录,证明了它能够对候选人的表现进行细致入微的多维度评估。我们的研究结果揭示了不同辩论策略的有效性及其对不同受众群体的影响。这项研究不仅为政治分析提供了一种新工具,而且还探讨了在复杂的社会环境中使用法律硕士作为公正评判者的潜力和局限性。此外,这一框架还为公民个人提供了评估总统辩论表现的独立工具,从而提高了民主参与度,减少了对可能存在偏见的媒体解读和机构影响的依赖,从而加强了公民知情参与的基础。
{"title":"LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models","authors":"Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu","doi":"arxiv-2409.08147","DOIUrl":"https://doi.org/arxiv-2409.08147","url":null,"abstract":"Large language models have demonstrated remarkable capabilities in natural\u0000language processing, yet their application to political discourse analysis\u0000remains underexplored. This paper introduces a novel approach to evaluating\u0000presidential debate performances using LLMs, addressing the longstanding\u0000challenge of objectively assessing debate outcomes. We propose a framework that\u0000analyzes candidates' \"Policies, Persona, and Perspective\" (3P) and how they\u0000resonate with the \"Interests, Ideologies, and Identity\" (3I) of four key\u0000audience groups: voters, businesses, donors, and politicians. Our method\u0000employs large language models to generate the LLM-POTUS Score, a quantitative\u0000measure of debate performance based on the alignment between 3P and 3I. We\u0000apply this framework to analyze transcripts from recent U.S. presidential\u0000debates, demonstrating its ability to provide nuanced, multi-dimensional\u0000assessments of candidate performances. Our results reveal insights into the\u0000effectiveness of different debating strategies and their impact on various\u0000audience segments. This study not only offers a new tool for political analysis\u0000but also explores the potential and limitations of using LLMs as impartial\u0000judges in complex social contexts. In addition, this framework provides\u0000individual citizens with an independent tool to evaluate presidential debate\u0000performances, which enhances democratic engagement and reduces reliance on\u0000potentially biased media interpretations and institutional influence, thereby\u0000strengthening the foundation of informed civic participation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency 超越 IID:从教学互动和依赖的角度优化教学学习
Pub Date : 2024-09-11 DOI: arxiv-2409.07045
Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan
With the availability of various instruction datasets, a pivotal challenge ishow to effectively select and integrate these instructions to fine-tune largelanguage models (LLMs). Previous research mainly focuses on selectingindividual high-quality instructions. However, these works overlooked the jointinteractions and dependencies between different categories of instructions,leading to suboptimal selection strategies. Moreover, the nature of theseinteraction patterns remains largely unexplored, let alone optimize theinstruction set with regard to them. To fill these gaps, in this paper, we: (1)systemically investigate interaction and dependency patterns between differentcategories of instructions, (2) manage to optimize the instruction setconcerning the interaction patterns using a linear programming-based method,and optimize the learning schema of SFT using an instruction dependencytaxonomy guided curriculum learning. Experimental results across different LLMsdemonstrate improved performance over strong baselines on widely adoptedbenchmarks.
随着各种指令数据集的出现,如何有效地选择和整合这些指令以微调大型语言模型(LLM)成为一个关键挑战。以往的研究主要集中于选择高质量的单个指令。然而,这些研究忽视了不同类别指令之间的联合交互和依赖关系,从而导致了次优的选择策略。此外,这些交互模式的本质在很大程度上仍未被探索,更不用说针对这些模式优化指令集了。为了填补这些空白,在本文中,我们将(1)系统地研究不同类别指令之间的交互和依赖模式;(2)利用基于线性规划的方法,设法优化与交互模式相关的指令集;以及利用指令依赖分类法引导课程学习,优化 SFT 的学习模式。不同 LLM 的实验结果表明,在广泛采用的基准测试中,SFT 的性能比强基准测试有所提高。
{"title":"Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency","authors":"Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan","doi":"arxiv-2409.07045","DOIUrl":"https://doi.org/arxiv-2409.07045","url":null,"abstract":"With the availability of various instruction datasets, a pivotal challenge is\u0000how to effectively select and integrate these instructions to fine-tune large\u0000language models (LLMs). Previous research mainly focuses on selecting\u0000individual high-quality instructions. However, these works overlooked the joint\u0000interactions and dependencies between different categories of instructions,\u0000leading to suboptimal selection strategies. Moreover, the nature of these\u0000interaction patterns remains largely unexplored, let alone optimize the\u0000instruction set with regard to them. To fill these gaps, in this paper, we: (1)\u0000systemically investigate interaction and dependency patterns between different\u0000categories of instructions, (2) manage to optimize the instruction set\u0000concerning the interaction patterns using a linear programming-based method,\u0000and optimize the learning schema of SFT using an instruction dependency\u0000taxonomy guided curriculum learning. Experimental results across different LLMs\u0000demonstrate improved performance over strong baselines on widely adopted\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem 交叉定义:通过串联学习改进自然语言解释生成
Pub Date : 2024-09-11 DOI: arxiv-2409.07123
Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt
Natural language explanations (NLEs) are vital for elucidating the reasoningbehind large language model (LLM) decisions. Many techniques have beendeveloped to generate NLEs using LLMs. However, like humans, LLMs might notalways produce optimal NLEs on first attempt. Inspired by human learningprocesses, we introduce Cross-Refine, which employs role modeling by deployingtwo LLMs as generator and critic, respectively. The generator outputs a firstNLE and then refines this initial explanation using feedback and suggestionsprovided by the critic. Cross-Refine does not require any supervised trainingdata or additional training. We validate Cross-Refine across three NLP tasksusing three state-of-the-art open-source LLMs through automatic and humanevaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, whichonly utilizes self-feedback to refine the explanations. Our findings fromautomatic evaluation and a user study indicate that Cross-Refine outperformsSelf-Refine. Meanwhile, Cross-Refine can perform effectively with less powerfulLLMs, whereas Self-Refine only yields strong results with ChatGPT.Additionally, we conduct an ablation study to assess the importance of feedbackand suggestions. Both of them play an important role in refining explanations.We further evaluate Cross-Refine on a bilingual dataset in English and German.
自然语言解释(NLE)对于阐明大型语言模型(LLM)决策背后的推理至关重要。目前已经开发了许多技术来使用 LLM 生成 NLE。然而,与人类一样,LLM 也不一定能在第一次尝试时生成最佳的 NLE。受人类学习过程的启发,我们引入了 Cross-Refine,它通过部署两个 LLM 分别作为生成器和批判器来进行角色建模。生成器输出第一个 NLE,然后利用批评者提供的反馈和建议完善这个初始解释。Cross-Refine 不需要任何监督训练数据或额外的训练。通过自动和人工评估,我们使用三种最先进的开源 LLM 在三个 NLP 任务中验证了 Cross-Refine。我们选择Self-Refine(Madaan等人,2023年)作为基线,它只利用自我反馈来完善解释。我们的自动评估和用户研究结果表明,Cross-Refine 优于 Self-Refine。同时,Cross-Refine 可以有效地使用功能较弱的LLM,而 Self-Refine 只有在使用 ChatGPT 时才能产生强大的效果。我们还在英语和德语的双语数据集上对 Cross-Refine 进行了进一步评估。
{"title":"Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem","authors":"Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt","doi":"arxiv-2409.07123","DOIUrl":"https://doi.org/arxiv-2409.07123","url":null,"abstract":"Natural language explanations (NLEs) are vital for elucidating the reasoning\u0000behind large language model (LLM) decisions. Many techniques have been\u0000developed to generate NLEs using LLMs. However, like humans, LLMs might not\u0000always produce optimal NLEs on first attempt. Inspired by human learning\u0000processes, we introduce Cross-Refine, which employs role modeling by deploying\u0000two LLMs as generator and critic, respectively. The generator outputs a first\u0000NLE and then refines this initial explanation using feedback and suggestions\u0000provided by the critic. Cross-Refine does not require any supervised training\u0000data or additional training. We validate Cross-Refine across three NLP tasks\u0000using three state-of-the-art open-source LLMs through automatic and human\u0000evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which\u0000only utilizes self-feedback to refine the explanations. Our findings from\u0000automatic evaluation and a user study indicate that Cross-Refine outperforms\u0000Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful\u0000LLMs, whereas Self-Refine only yields strong results with ChatGPT.\u0000Additionally, we conduct an ablation study to assess the importance of feedback\u0000and suggestions. Both of them play an important role in refining explanations.\u0000We further evaluate Cross-Refine on a bilingual dataset in English and German.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications MEDIC:建立评估临床应用中的法律硕士的综合框架
Pub Date : 2024-09-11 DOI: arxiv-2409.07314
Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan
The rapid development of Large Language Models (LLMs) for healthcareapplications has spurred calls for holistic evaluation beyond frequently-citedbenchmarks like USMLE, to better reflect real-world performance. Whilereal-world assessments are valuable indicators of utility, they often lagbehind the pace of LLM evolution, likely rendering findings obsolete upondeployment. This temporal disconnect necessitates a comprehensive upfrontevaluation that can guide model selection for specific clinical applications.We introduce MEDIC, a framework assessing LLMs across five critical dimensionsof clinical competence: medical reasoning, ethics and bias, data and languageunderstanding, in-context learning, and clinical safety. MEDIC features a novelcross-examination framework quantifying LLM performance across areas likecoverage and hallucination detection, without requiring reference outputs. Weapply MEDIC to evaluate LLMs on medical question-answering, safety,summarization, note generation, and other tasks. Our results show performancedisparities across model sizes, baseline vs medically finetuned models, andhave implications on model selection for applications requiring specific modelstrengths, such as low hallucination or lower cost of inference. MEDIC'smultifaceted evaluation reveals these performance trade-offs, bridging the gapbetween theoretical capabilities and practical implementation in healthcaresettings, ensuring that the most promising models are identified and adaptedfor diverse healthcare applications.
用于医疗保健应用的大型语言模型(LLMs)的快速发展促使人们呼吁在 USMLE 等经常被引用的基准之外进行整体评估,以更好地反映真实世界的性能。虽然真实世界的评估是衡量实用性的重要指标,但它们往往落后于 LLM 的发展速度,很可能导致评估结果在部署前就已经过时。我们介绍了 MEDIC,这是一个评估 LLM 的框架,涉及临床能力的五个关键维度:医学推理、伦理与偏见、数据与语言理解、情境学习和临床安全。MEDIC 具有一个新颖的交叉检查框架,可量化 LLM 在覆盖率和幻觉检测等方面的表现,而无需参考输出。我们将 MEDIC 用于评估 LLM 在医学问题解答、安全性、总结、笔记生成和其他任务上的表现。我们的结果表明,不同大小的模型、基线模型与经过医学微调的模型在性能上存在差异,并对需要特定模型优势(如低幻觉或较低推理成本)的应用中的模型选择产生了影响。MEDIC 的多方面评估揭示了这些性能权衡,缩小了理论能力与医疗环境中实际应用之间的差距,确保识别出最有前途的模型,并将其应用于各种医疗应用。
{"title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications","authors":"Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan","doi":"arxiv-2409.07314","DOIUrl":"https://doi.org/arxiv-2409.07314","url":null,"abstract":"The rapid development of Large Language Models (LLMs) for healthcare\u0000applications has spurred calls for holistic evaluation beyond frequently-cited\u0000benchmarks like USMLE, to better reflect real-world performance. While\u0000real-world assessments are valuable indicators of utility, they often lag\u0000behind the pace of LLM evolution, likely rendering findings obsolete upon\u0000deployment. This temporal disconnect necessitates a comprehensive upfront\u0000evaluation that can guide model selection for specific clinical applications.\u0000We introduce MEDIC, a framework assessing LLMs across five critical dimensions\u0000of clinical competence: medical reasoning, ethics and bias, data and language\u0000understanding, in-context learning, and clinical safety. MEDIC features a novel\u0000cross-examination framework quantifying LLM performance across areas like\u0000coverage and hallucination detection, without requiring reference outputs. We\u0000apply MEDIC to evaluate LLMs on medical question-answering, safety,\u0000summarization, note generation, and other tasks. Our results show performance\u0000disparities across model sizes, baseline vs medically finetuned models, and\u0000have implications on model selection for applications requiring specific model\u0000strengths, such as low hallucination or lower cost of inference. MEDIC's\u0000multifaceted evaluation reveals these performance trade-offs, bridging the gap\u0000between theoretical capabilities and practical implementation in healthcare\u0000settings, ensuring that the most promising models are identified and adapted\u0000for diverse healthcare applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs 从宣传到仇恨:利用多代理 LLM 对阿拉伯语备忘录进行多模态分析
Pub Date : 2024-09-11 DOI: arxiv-2409.07246
Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros
In the past decade, social media platforms have been used for informationdissemination and consumption. While a major portion of the content is postedto promote citizen journalism and public awareness, some content is posted tomislead users. Among different content types such as text, images, and videos,memes (text overlaid on images) are particularly prevalent and can serve aspowerful vehicles for propaganda, hate, and humor. In the current literature,there have been efforts to individually detect such content in memes. However,the study of their intersection is very limited. In this study, we explore theintersection between propaganda and hate in memes using a multi-agent LLM-basedapproach. We extend the propagandistic meme dataset with coarse andfine-grained hate labels. Our finding suggests that there is an associationbetween propaganda and hate in memes. We provide detailed experimental resultsthat can serve as a baseline for future studies. We will make the experimentalresources publicly available to the community.
在过去十年中,社交媒体平台被用于信息传播和消费。虽然大部分内容是为了促进公民新闻和提高公众意识而发布的,但也有一些内容是为了引导用户而发布的。在文字、图片和视频等不同内容类型中,memes(文字叠加在图片上)尤为盛行,可作为宣传、仇恨和幽默的有力载体。在目前的文献中,已经有人在努力单独检测memes 中的此类内容。然而,对它们之间交叉关系的研究却非常有限。在本研究中,我们使用基于多代理 LLM 的方法来探索记忆体中宣传与仇恨的交集。我们用粗粒度和细粒度的仇恨标签扩展了宣传性备忘录数据集。我们的发现表明,记忆体中的宣传和仇恨之间存在关联。我们提供了详细的实验结果,可作为未来研究的基线。我们将向社会公开实验资源。
{"title":"Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs","authors":"Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros","doi":"arxiv-2409.07246","DOIUrl":"https://doi.org/arxiv-2409.07246","url":null,"abstract":"In the past decade, social media platforms have been used for information\u0000dissemination and consumption. While a major portion of the content is posted\u0000to promote citizen journalism and public awareness, some content is posted to\u0000mislead users. Among different content types such as text, images, and videos,\u0000memes (text overlaid on images) are particularly prevalent and can serve as\u0000powerful vehicles for propaganda, hate, and humor. In the current literature,\u0000there have been efforts to individually detect such content in memes. However,\u0000the study of their intersection is very limited. In this study, we explore the\u0000intersection between propaganda and hate in memes using a multi-agent LLM-based\u0000approach. We extend the propagandistic meme dataset with coarse and\u0000fine-grained hate labels. Our finding suggests that there is an association\u0000between propaganda and hate in memes. We provide detailed experimental results\u0000that can serve as a baseline for future studies. We will make the experimental\u0000resources publicly available to the community.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computation and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1