arXiv - CS - Computation and Language最新文献

英文中文

Fine-tuning Large Language Models for Entity Matching 微调用于实体匹配的大型语言模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.08185

Aaron Steiner, Ralph Peeters, Christian Bizer

Generative large language models (LLMs) are a promising alternative topre-trained language models for entity matching due to their high zero-shotperformance and their ability to generalize to unseen entities. Existingresearch on using LLMs for entity matching has focused on prompt engineeringand in-context learning. This paper explores the potential of fine-tuning LLMsfor entity matching. We analyze fine-tuning along two dimensions: 1) Therepresentation of training examples, where we experiment with adding differenttypes of LLM-generated explanations to the training set, and 2) the selectionand generation of training examples using LLMs. In addition to the matchingperformance on the source dataset, we investigate how fine-tuning affects themodel's ability to generalize to other in-domain datasets as well as acrosstopical domains. Our experiments show that fine-tuning significantly improvesthe performance of the smaller models while the results for the larger modelsare mixed. Fine-tuning also improves the generalization to in-domain datasetswhile hurting cross-domain transfer. We show that adding structuredexplanations to the training set has a positive impact on the performance ofthree out of four LLMs, while the proposed example selection and generationmethods only improve the performance of Llama 3.1 8B while decreasing theperformance of GPT-4o Mini.

生成式大语言模型（LLMs）由于其较高的零点性能和泛化到未见实体的能力，在实体匹配方面是经过重新训练的语言模型的一种有前途的替代方案。关于使用 LLMs 进行实体匹配的现有研究主要集中在提示工程和上下文学习方面。本文探讨了微调 LLM 用于实体匹配的潜力。我们从两个方面对微调进行了分析：1）训练示例的呈现，我们尝试在训练集中添加不同类型的 LLM 生成的解释；2）使用 LLM 选择和生成训练示例。除了源数据集上的匹配性能外，我们还研究了微调如何影响模型泛化到其他领域内数据集以及跨领域的能力。我们的实验表明，微调显著提高了较小模型的性能，而较大模型的结果则参差不齐。微调还提高了对域内数据集的泛化，同时损害了跨域转移。我们的研究表明，在训练集中添加结构解释对四个 LLM 中三个模型的性能有积极影响，而所提出的示例选择和生成方法只提高了 Llama 3.1 8B 的性能，却降低了 GPT-4o Mini 的性能。

{"title":"Fine-tuning Large Language Models for Entity Matching","authors":"Aaron Steiner, Ralph Peeters, Christian Bizer","doi":"arxiv-2409.08185","DOIUrl":"https://doi.org/arxiv-2409.08185","url":null,"abstract":"Generative large language models (LLMs) are a promising alternative to\u0000pre-trained language models for entity matching due to their high zero-shot\u0000performance and their ability to generalize to unseen entities. Existing\u0000research on using LLMs for entity matching has focused on prompt engineering\u0000and in-context learning. This paper explores the potential of fine-tuning LLMs\u0000for entity matching. We analyze fine-tuning along two dimensions: 1) The\u0000representation of training examples, where we experiment with adding different\u0000types of LLM-generated explanations to the training set, and 2) the selection\u0000and generation of training examples using LLMs. In addition to the matching\u0000performance on the source dataset, we investigate how fine-tuning affects the\u0000model's ability to generalize to other in-domain datasets as well as across\u0000topical domains. Our experiments show that fine-tuning significantly improves\u0000the performance of the smaller models while the results for the larger models\u0000are mixed. Fine-tuning also improves the generalization to in-domain datasets\u0000while hurting cross-domain transfer. We show that adding structured\u0000explanations to the training set has a positive impact on the performance of\u0000three out of four LLMs, while the proposed example selection and generation\u0000methods only improve the performance of Llama 3.1 8B while decreasing the\u0000performance of GPT-4o Mini.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Top-down Activity Representation Learning for Video Question Answering 用于视频问题解答的自顶向下活动表示学习

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.07748

Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa

Capturing complex hierarchical human activities, from atomic actions (e.g.,picking up one present, moving to the sofa, unwrapping the present) tocontextual events (e.g., celebrating Christmas) is crucial for achievinghigh-performance video question answering (VideoQA). Recent works have expandedmultimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,enhancing the model's temporal reasoning capabilities. However, theseapproaches often fail to capture contextual events that can be decomposed intomultiple atomic actions non-continuously distributed over relatively long-termsequences. In this paper, to leverage the spatial visual context representationcapability of the CLIP model for obtaining non-continuous visualrepresentations in terms of contextual events in videos, we convert long-termvideo sequences into a spatial image domain and finetune the multimodal modelLLaVA for the VideoQA task. Our approach achieves competitive performance onthe STAR task, in particular, with a 78.4% accuracy score, exceeding thecurrent state-of-the-art score by 2.8 points on the NExTQA task.

要实现高性能的视频问题解答（VideoQA），捕捉从原子动作（如拿起一件礼物、移到沙发上、拆开礼物）到上下文事件（如庆祝圣诞节）的复杂分层人类活动至关重要。最近的研究已经将多模态模型（如 CLIP、LLaVA）扩展到处理连续视频序列，从而增强了模型的时间推理能力。然而，这些方法往往无法捕捉到上下文事件，而这些事件可以分解成多个原子动作，非连续地分布在相对较长的序列中。在本文中，为了利用 CLIP 模型的空间视觉上下文表示能力来获取视频中上下文事件的非连续视觉表示，我们将长期视频序列转换为空间图像域，并针对视频质量保证任务对多模态模型LLaVA 进行了微调。我们的方法在 STAR 任务中取得了极具竞争力的性能，特别是在 NExTQA 任务中，准确率高达 78.4%，比目前最先进的方法高出 2.8 分。

{"title":"Top-down Activity Representation Learning for Video Question Answering","authors":"Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, Mori Kurokawa","doi":"arxiv-2409.07748","DOIUrl":"https://doi.org/arxiv-2409.07748","url":null,"abstract":"Capturing complex hierarchical human activities, from atomic actions (e.g.,\u0000picking up one present, moving to the sofa, unwrapping the present) to\u0000contextual events (e.g., celebrating Christmas) is crucial for achieving\u0000high-performance video question answering (VideoQA). Recent works have expanded\u0000multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,\u0000enhancing the model's temporal reasoning capabilities. However, these\u0000approaches often fail to capture contextual events that can be decomposed into\u0000multiple atomic actions non-continuously distributed over relatively long-term\u0000sequences. In this paper, to leverage the spatial visual context representation\u0000capability of the CLIP model for obtaining non-continuous visual\u0000representations in terms of contextual events in videos, we convert long-term\u0000video sequences into a spatial image domain and finetune the multimodal model\u0000LLaVA for the VideoQA task. Our approach achieves competitive performance on\u0000the STAR task, in particular, with a 78.4% accuracy score, exceeding the\u0000current state-of-the-art score by 2.8 points on the NExTQA task.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice 尝试法律人工智能解决方案：司法救助问题解答案例

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.07713

Jonathan Li, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu

Generative AI models, such as the GPT and Llama series, have significantpotential to assist laypeople in answering legal questions. However, littleprior work focuses on the data sourcing, inference, and evaluation of thesemodels in the context of laypersons. To this end, we propose a human-centriclegal NLP pipeline, covering data sourcing, inference, and evaluation. Weintroduce and release a dataset, LegalQA, with real and specific legalquestions spanning from employment law to criminal law, corresponding answerswritten by legal experts, and citations for each answer. We develop anautomatic evaluation protocol for this dataset, then show thatretrieval-augmented generation from only 850 citations in the train set canmatch or outperform internet-wide retrieval, despite containing 9 orders ofmagnitude less data. Finally, we propose future directions for open-sourcedefforts, which fall behind closed-sourced models.

生成式人工智能模型（如 GPT 和 Llama 系列）在帮助非专业人士回答法律问题方面具有巨大潜力。然而，之前很少有研究关注这些模型在非专业人士语境下的数据来源、推理和评估。为此，我们提出了一个以人为中心的法律 NLP 管道，涵盖数据来源、推理和评估。我们引入并发布了一个名为 LegalQA 的数据集，其中包含从劳动法到刑法的真实而具体的法律问题、法律专家撰写的相应答案以及每个答案的引文。我们为该数据集开发了一个自动评估协议，然后表明，尽管数据量少了 9 个数量级，但从训练集中仅 850 条引文中生成的检索增强结果可以与整个互联网的检索结果相媲美，甚至更胜一筹。最后，我们为落后于闭源模型的开源努力提出了未来的方向。

引用次数: 0

Controllable Synthetic Clinical Note Generation with Privacy Guarantees 可控合成临床笔记生成与隐私保证

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.07809

Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim

In the field of machine learning, domain-specific annotated data is aninvaluable resource for training effective models. However, in the medicaldomain, this data often includes Personal Health Information (PHI), raisingsignificant privacy concerns. The stringent regulations surrounding PHI limitthe availability and sharing of medical datasets, which poses a substantialchallenge for researchers and practitioners aiming to develop advanced machinelearning models. In this paper, we introduce a novel method to "clone" datasetscontaining PHI. Our approach ensures that the cloned datasets retain theessential characteristics and utility of the original data without compromisingpatient privacy. By leveraging differential-privacy techniques and a novelfine-tuning task, our method produces datasets that are free from identifiableinformation while preserving the statistical properties necessary for modeltraining. We conduct utility testing to evaluate the performance of machinelearning models trained on the cloned datasets. The results demonstrate thatour cloned datasets not only uphold privacy standards but also enhance modelperformance compared to those trained on traditional anonymized datasets. Thiswork offers a viable solution for the ethical and effective utilization ofsensitive medical data in machine learning, facilitating progress in medicalresearch and the development of robust predictive models.

在机器学习领域，特定领域的注释数据是训练有效模型的宝贵资源。然而，在医疗领域，这些数据通常包括个人健康信息（PHI），从而引发了重大的隐私问题。围绕 PHI 的严格法规限制了医疗数据集的可用性和共享性，这给旨在开发先进机器学习模型的研究人员和从业人员带来了巨大挑战。在本文中，我们介绍了一种 "克隆 "包含 PHI 的数据集的新方法。我们的方法可确保克隆数据集保留原始数据的基本特征和效用，同时不损害患者隐私。通过利用差分隐私技术和新颖的微调任务，我们的方法生成了不含可识别信息的数据集，同时保留了模型训练所需的统计属性。我们进行了实用性测试，以评估在克隆数据集上训练的机器学习模型的性能。结果表明，与在传统匿名数据集上训练的模型相比，我们的克隆数据集不仅维护了隐私标准，还提高了模型性能。这项工作为在机器学习中道德和有效地利用敏感医疗数据提供了可行的解决方案，促进了医学研究的进步和强大预测模型的开发。

{"title":"Controllable Synthetic Clinical Note Generation with Privacy Guarantees","authors":"Tal BaumelAri, Andre ManoelAri, Daniel JonesAri, Shize SuAri, Huseyin InanAri, AaronAri, Bornstein, Robert Sim","doi":"arxiv-2409.07809","DOIUrl":"https://doi.org/arxiv-2409.07809","url":null,"abstract":"In the field of machine learning, domain-specific annotated data is an\u0000invaluable resource for training effective models. However, in the medical\u0000domain, this data often includes Personal Health Information (PHI), raising\u0000significant privacy concerns. The stringent regulations surrounding PHI limit\u0000the availability and sharing of medical datasets, which poses a substantial\u0000challenge for researchers and practitioners aiming to develop advanced machine\u0000learning models. In this paper, we introduce a novel method to \"clone\" datasets\u0000containing PHI. Our approach ensures that the cloned datasets retain the\u0000essential characteristics and utility of the original data without compromising\u0000patient privacy. By leveraging differential-privacy techniques and a novel\u0000fine-tuning task, our method produces datasets that are free from identifiable\u0000information while preserving the statistical properties necessary for model\u0000training. We conduct utility testing to evaluate the performance of machine\u0000learning models trained on the cloned datasets. The results demonstrate that\u0000our cloned datasets not only uphold privacy standards but also enhance model\u0000performance compared to those trained on traditional anonymized datasets. This\u0000work offers a viable solution for the ethical and effective utilization of\u0000sensitive medical data in machine learning, facilitating progress in medical\u0000research and the development of robust predictive models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ruri: Japanese General Text Embeddings Ruri：日语通用文本嵌入

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.07737

Hayato Tsukagoshi, Ryohei Sasano

We report the development of Ruri, a series of Japanese general textembedding models. While the development of general-purpose text embeddingmodels in English and multilingual contexts has been active in recent years,model development in Japanese remains insufficient. The primary reasons forthis are the lack of datasets and the absence of necessary expertise. In thisreport, we provide a detailed account of the development process of Ruri.Specifically, we discuss the training of embedding models using synthesizeddatasets generated by LLMs, the construction of the reranker for datasetfiltering and knowledge distillation, and the performance evaluation of theresulting general-purpose text embedding models.

我们报告了一系列日语通用文本嵌入模型 Ruri 的开发情况。近年来，英语和多语言环境下的通用文本嵌入模型的开发十分活跃，但日语模型的开发仍然不足。其主要原因是缺乏数据集和必要的专业知识。在本报告中，我们详细介绍了 Ruri 的开发过程。具体来说，我们讨论了使用 LLM 生成的合成数据集训练嵌入模型、构建用于数据集过滤和知识提炼的 reranker，以及对所生成的通用文本嵌入模型进行性能评估。

引用次数: 0

LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models LLM-POTUS Score：利用大型语言模型分析总统辩论的框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-12 DOI: arxiv-2409.08147

Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu

Large language models have demonstrated remarkable capabilities in naturallanguage processing, yet their application to political discourse analysisremains underexplored. This paper introduces a novel approach to evaluatingpresidential debate performances using LLMs, addressing the longstandingchallenge of objectively assessing debate outcomes. We propose a framework thatanalyzes candidates' "Policies, Persona, and Perspective" (3P) and how theyresonate with the "Interests, Ideologies, and Identity" (3I) of four keyaudience groups: voters, businesses, donors, and politicians. Our methodemploys large language models to generate the LLM-POTUS Score, a quantitativemeasure of debate performance based on the alignment between 3P and 3I. Weapply this framework to analyze transcripts from recent U.S. presidentialdebates, demonstrating its ability to provide nuanced, multi-dimensionalassessments of candidate performances. Our results reveal insights into theeffectiveness of different debating strategies and their impact on variousaudience segments. This study not only offers a new tool for political analysisbut also explores the potential and limitations of using LLMs as impartialjudges in complex social contexts. In addition, this framework providesindividual citizens with an independent tool to evaluate presidential debateperformances, which enhances democratic engagement and reduces reliance onpotentially biased media interpretations and institutional influence, therebystrengthening the foundation of informed civic participation.

大型语言模型在自然语言处理方面已展现出非凡的能力，但其在政治话语分析中的应用仍未得到充分探索。本文介绍了一种利用大型语言模型评估总统辩论表现的新方法，解决了客观评估辩论结果这一长期难题。我们提出了一个框架，分析候选人的 "政策、角色和观点"（3P），以及这些政策、角色和观点如何与选民、企业、捐赠者和政治家这四个关键受众群体的 "利益、意识形态和身份"（3I）产生共鸣。我们的方法利用大型语言模型生成 LLM-POTUS Score，这是一种基于 3P 和 3I 之间一致性的辩论表现量化测量方法。我们将这一框架用于分析最近几场美国总统辩论的文字记录，证明了它能够对候选人的表现进行细致入微的多维度评估。我们的研究结果揭示了不同辩论策略的有效性及其对不同受众群体的影响。这项研究不仅为政治分析提供了一种新工具，而且还探讨了在复杂的社会环境中使用法律硕士作为公正评判者的潜力和局限性。此外，这一框架还为公民个人提供了评估总统辩论表现的独立工具，从而提高了民主参与度，减少了对可能存在偏见的媒体解读和机构影响的依赖，从而加强了公民知情参与的基础。

{"title":"LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models","authors":"Zhengliang Liu, Yiwei Li, Oleksandra Zolotarevych, Rongwei Yang, Tianming Liu","doi":"arxiv-2409.08147","DOIUrl":"https://doi.org/arxiv-2409.08147","url":null,"abstract":"Large language models have demonstrated remarkable capabilities in natural\u0000language processing, yet their application to political discourse analysis\u0000remains underexplored. This paper introduces a novel approach to evaluating\u0000presidential debate performances using LLMs, addressing the longstanding\u0000challenge of objectively assessing debate outcomes. We propose a framework that\u0000analyzes candidates' \"Policies, Persona, and Perspective\" (3P) and how they\u0000resonate with the \"Interests, Ideologies, and Identity\" (3I) of four key\u0000audience groups: voters, businesses, donors, and politicians. Our method\u0000employs large language models to generate the LLM-POTUS Score, a quantitative\u0000measure of debate performance based on the alignment between 3P and 3I. We\u0000apply this framework to analyze transcripts from recent U.S. presidential\u0000debates, demonstrating its ability to provide nuanced, multi-dimensional\u0000assessments of candidate performances. Our results reveal insights into the\u0000effectiveness of different debating strategies and their impact on various\u0000audience segments. This study not only offers a new tool for political analysis\u0000but also explores the potential and limitations of using LLMs as impartial\u0000judges in complex social contexts. In addition, this framework provides\u0000individual citizens with an independent tool to evaluate presidential debate\u0000performances, which enhances democratic engagement and reduces reliance on\u0000potentially biased media interpretations and institutional influence, thereby\u0000strengthening the foundation of informed civic participation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence 利用基于图形的口语应答一致性新模型自动评估会话测试的口语水平

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07064

Jiun-Ting Li, Bi-Cheng Yan, Tien-Hong Lo, Yi-Cheng Wang, Yung-Chang Hsu, Berlin Chen

Automated speaking assessment in conversation tests (ASAC) aims to evaluatethe overall speaking proficiency of an L2 (second-language) speaker in asetting where an interlocutor interacts with one or more candidates. Althoughprior ASAC approaches have shown promising performance on their respectivedatasets, there is still a dearth of research specifically focused onincorporating the coherence of the logical flow within a conversation into thegrading model. To address this critical challenge, we propose a hierarchicalgraph model that aptly incorporates both broad inter-response interactions(e.g., discourse relations) and nuanced semantic information (e.g., semanticwords and speaker intents), which is subsequently fused with contextualinformation for the final prediction. Extensive experimental results on theNICT-JLE benchmark dataset suggest that our proposed modeling approach canyield considerable improvements in prediction accuracy with respect to variousassessment metrics, as compared to some strong baselines. This also sheds lighton the importance of investigating coherence-related facets of spoken responsesin ASAC.

会话测试中的自动口语评估（ASAC）旨在评估第二语言（L2）说话者在一个对话者与一个或多个候选人互动的环境中的整体口语水平。尽管之前的 ASAC 方法在各自的数据集上都表现出了良好的性能，但专门针对将对话中逻辑流的连贯性纳入评分模型的研究仍然十分匮乏。为了应对这一关键挑战，我们提出了一种分层图模型，该模型既能恰当地纳入广泛的应答间交互（如话语关系），也能纳入细微的语义信息（如语义词和说话者意图），然后将这些信息与上下文信息融合在一起进行最终预测。在 NICT-JLE 基准数据集上的大量实验结果表明，与一些强大的基准相比，我们提出的建模方法在各种评估指标上都能显著提高预测准确性。这也揭示了在 ASAC 中研究口语回答的连贯性相关方面的重要性。

{"title":"Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence","authors":"Jiun-Ting Li, Bi-Cheng Yan, Tien-Hong Lo, Yi-Cheng Wang, Yung-Chang Hsu, Berlin Chen","doi":"arxiv-2409.07064","DOIUrl":"https://doi.org/arxiv-2409.07064","url":null,"abstract":"Automated speaking assessment in conversation tests (ASAC) aims to evaluate\u0000the overall speaking proficiency of an L2 (second-language) speaker in a\u0000setting where an interlocutor interacts with one or more candidates. Although\u0000prior ASAC approaches have shown promising performance on their respective\u0000datasets, there is still a dearth of research specifically focused on\u0000incorporating the coherence of the logical flow within a conversation into the\u0000grading model. To address this critical challenge, we propose a hierarchical\u0000graph model that aptly incorporates both broad inter-response interactions\u0000(e.g., discourse relations) and nuanced semantic information (e.g., semantic\u0000words and speaker intents), which is subsequently fused with contextual\u0000information for the final prediction. Extensive experimental results on the\u0000NICT-JLE benchmark dataset suggest that our proposed modeling approach can\u0000yield considerable improvements in prediction accuracy with respect to various\u0000assessment metrics, as compared to some strong baselines. This also sheds light\u0000on the importance of investigating coherence-related facets of spoken responses\u0000in ASAC.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-Shot Machine-Generated Text Detection Using Mixture of Large Language Models 使用大型语言模型混合物进行零镜头机器生成文本检测

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07615

Matthieu Dubois, François Yvon, Pablo Piantanida

The dissemination of Large Language Models (LLMs), trained at scale, andendowed with powerful text-generating abilities has vastly increased thethreats posed by generative AI technologies by reducing the cost of producingharmful, toxic, faked or forged content. In response, various proposals havebeen made to automatically discriminate artificially generated fromhuman-written texts, typically framing the problem as a classification problem.Most approaches evaluate an input document by a well-chosen detector LLM,assuming that low-perplexity scores reliably signal machine-made content. Asusing one single detector can induce brittleness of performance, we insteadconsider several and derive a new, theoretically grounded approach to combinetheir respective strengths. Our experiments, using a variety of generator LLMs,suggest that our method effectively increases the robustness of detection.

经过大规模训练并具备强大文本生成能力的大型语言模型（LLM）的传播，降低了制作有害、有毒、伪造或伪造内容的成本，从而大大增加了生成式人工智能技术带来的威胁。为此，人们提出了各种建议，以自动区分人工生成的文本和人类撰写的文本，通常将这一问题视为一个分类问题。大多数方法都是通过精心选择的检测器 LLM 来评估输入文档，并假设低复杂度分数是机器生成内容的可靠信号。由于使用单个检测器会导致性能脆性，我们转而考虑多个检测器，并推导出一种基于理论的新方法来结合它们各自的优势。我们使用各种生成器 LLM 进行的实验表明，我们的方法能有效提高检测的鲁棒性。

引用次数: 0

Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency 超越 IID：从教学互动和依赖的角度优化教学学习

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07045

Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan

With the availability of various instruction datasets, a pivotal challenge ishow to effectively select and integrate these instructions to fine-tune largelanguage models (LLMs). Previous research mainly focuses on selectingindividual high-quality instructions. However, these works overlooked the jointinteractions and dependencies between different categories of instructions,leading to suboptimal selection strategies. Moreover, the nature of theseinteraction patterns remains largely unexplored, let alone optimize theinstruction set with regard to them. To fill these gaps, in this paper, we: (1)systemically investigate interaction and dependency patterns between differentcategories of instructions, (2) manage to optimize the instruction setconcerning the interaction patterns using a linear programming-based method,and optimize the learning schema of SFT using an instruction dependencytaxonomy guided curriculum learning. Experimental results across different LLMsdemonstrate improved performance over strong baselines on widely adoptedbenchmarks.

随着各种指令数据集的出现，如何有效地选择和整合这些指令以微调大型语言模型（LLM）成为一个关键挑战。以往的研究主要集中于选择高质量的单个指令。然而，这些研究忽视了不同类别指令之间的联合交互和依赖关系，从而导致了次优的选择策略。此外，这些交互模式的本质在很大程度上仍未被探索，更不用说针对这些模式优化指令集了。为了填补这些空白，在本文中，我们将(1)系统地研究不同类别指令之间的交互和依赖模式；(2)利用基于线性规划的方法，设法优化与交互模式相关的指令集；以及利用指令依赖分类法引导课程学习，优化 SFT 的学习模式。不同 LLM 的实验结果表明，在广泛采用的基准测试中，SFT 的性能比强基准测试有所提高。

{"title":"Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency","authors":"Hanyu Zhao, Li Du, Yiming Ju, Chengwei Wu, Tengfei Pan","doi":"arxiv-2409.07045","DOIUrl":"https://doi.org/arxiv-2409.07045","url":null,"abstract":"With the availability of various instruction datasets, a pivotal challenge is\u0000how to effectively select and integrate these instructions to fine-tune large\u0000language models (LLMs). Previous research mainly focuses on selecting\u0000individual high-quality instructions. However, these works overlooked the joint\u0000interactions and dependencies between different categories of instructions,\u0000leading to suboptimal selection strategies. Moreover, the nature of these\u0000interaction patterns remains largely unexplored, let alone optimize the\u0000instruction set with regard to them. To fill these gaps, in this paper, we: (1)\u0000systemically investigate interaction and dependency patterns between different\u0000categories of instructions, (2) manage to optimize the instruction set\u0000concerning the interaction patterns using a linear programming-based method,\u0000and optimize the learning schema of SFT using an instruction dependency\u0000taxonomy guided curriculum learning. Experimental results across different LLMs\u0000demonstrate improved performance over strong baselines on widely adopted\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications MEDIC：建立评估临床应用中的法律硕士的综合框架

arXiv - CS - Computation and Language

Pub Date : 2024-09-11 DOI: arxiv-2409.07314

Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

The rapid development of Large Language Models (LLMs) for healthcareapplications has spurred calls for holistic evaluation beyond frequently-citedbenchmarks like USMLE, to better reflect real-world performance. Whilereal-world assessments are valuable indicators of utility, they often lagbehind the pace of LLM evolution, likely rendering findings obsolete upondeployment. This temporal disconnect necessitates a comprehensive upfrontevaluation that can guide model selection for specific clinical applications.We introduce MEDIC, a framework assessing LLMs across five critical dimensionsof clinical competence: medical reasoning, ethics and bias, data and languageunderstanding, in-context learning, and clinical safety. MEDIC features a novelcross-examination framework quantifying LLM performance across areas likecoverage and hallucination detection, without requiring reference outputs. Weapply MEDIC to evaluate LLMs on medical question-answering, safety,summarization, note generation, and other tasks. Our results show performancedisparities across model sizes, baseline vs medically finetuned models, andhave implications on model selection for applications requiring specific modelstrengths, such as low hallucination or lower cost of inference. MEDIC'smultifaceted evaluation reveals these performance trade-offs, bridging the gapbetween theoretical capabilities and practical implementation in healthcaresettings, ensuring that the most promising models are identified and adaptedfor diverse healthcare applications.

用于医疗保健应用的大型语言模型（LLMs）的快速发展促使人们呼吁在 USMLE 等经常被引用的基准之外进行整体评估，以更好地反映真实世界的性能。虽然真实世界的评估是衡量实用性的重要指标，但它们往往落后于 LLM 的发展速度，很可能导致评估结果在部署前就已经过时。我们介绍了 MEDIC，这是一个评估 LLM 的框架，涉及临床能力的五个关键维度：医学推理、伦理与偏见、数据与语言理解、情境学习和临床安全。MEDIC 具有一个新颖的交叉检查框架，可量化 LLM 在覆盖率和幻觉检测等方面的表现，而无需参考输出。我们将 MEDIC 用于评估 LLM 在医学问题解答、安全性、总结、笔记生成和其他任务上的表现。我们的结果表明，不同大小的模型、基线模型与经过医学微调的模型在性能上存在差异，并对需要特定模型优势（如低幻觉或较低推理成本）的应用中的模型选择产生了影响。MEDIC 的多方面评估揭示了这些性能权衡，缩小了理论能力与医疗环境中实际应用之间的差距，确保识别出最有前途的模型，并将其应用于各种医疗应用。

{"title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications","authors":"Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan","doi":"arxiv-2409.07314","DOIUrl":"https://doi.org/arxiv-2409.07314","url":null,"abstract":"The rapid development of Large Language Models (LLMs) for healthcare\u0000applications has spurred calls for holistic evaluation beyond frequently-cited\u0000benchmarks like USMLE, to better reflect real-world performance. While\u0000real-world assessments are valuable indicators of utility, they often lag\u0000behind the pace of LLM evolution, likely rendering findings obsolete upon\u0000deployment. This temporal disconnect necessitates a comprehensive upfront\u0000evaluation that can guide model selection for specific clinical applications.\u0000We introduce MEDIC, a framework assessing LLMs across five critical dimensions\u0000of clinical competence: medical reasoning, ethics and bias, data and language\u0000understanding, in-context learning, and clinical safety. MEDIC features a novel\u0000cross-examination framework quantifying LLM performance across areas like\u0000coverage and hallucination detection, without requiring reference outputs. We\u0000apply MEDIC to evaluate LLMs on medical question-answering, safety,\u0000summarization, note generation, and other tasks. Our results show performance\u0000disparities across model sizes, baseline vs medically finetuned models, and\u0000have implications on model selection for applications requiring specific model\u0000strengths, such as low hallucination or lower cost of inference. MEDIC's\u0000multifaceted evaluation reveals these performance trade-offs, bridging the gap\u0000between theoretical capabilities and practical implementation in healthcare\u0000settings, ensuring that the most promising models are identified and adapted\u0000for diverse healthcare applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Computation and Language

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀