arXiv - CS - Computation and Language最新文献_第6页

Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs 小语言模型在创意短文写作中胜过人类：将 SLM 与人类和 LLM 进行比较的研究

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11547

Guillermo Marco, Luz Rello, Julio Gonzalo

In this paper, we evaluate the creative fiction writing abilities of afine-tuned small language model (SLM), BART Large, and compare its performanceto humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Ourevaluation consists of two experiments: (i) a human evaluation where readersassess the stories generated by the SLM compared to human-written stories, and(ii) a qualitative linguistic analysis comparing the textual characteristics ofthe stories generated by the different models. In the first experiment, weasked 68 participants to rate short stories generated by the models and humansalong dimensions such as grammaticality, relevance, creativity, andattractiveness. BART Large outperformed human writers in most aspects, exceptcreativity, with an overall score of 2.11 compared to 1.85 for human-writtentexts -- a 14% improvement. In the second experiment, the qualitative analysisrevealed that, while GPT-4o exhibited near-perfect internal and externalcoherence, it tended to produce more predictable narratives, with only 3% ofits stories seen as novel. In contrast, 15% of BART's stories were considerednovel, indicating a higher degree of creativity despite its smaller model size.This study provides both quantitative and qualitative insights into how modelsize and fine-tuning influence the balance between creativity, fluency, andcoherence in creative writing tasks.

在本文中，我们评估了经过精细调整的小语言模型（SLM）BART Large 的小说创作能力，并将其表现与人类和两种大语言模型（LLM）进行了比较：GPT-3.5 和 GPT-4o。评估包括两个实验：(i)人类评估，读者将 SLM 生成的故事与人类编写的故事进行比较评估；(ii)定性语言分析，比较不同模型生成的故事的文本特征。在第一个实验中，我们请 68 名参与者对模型和人类编写的短篇故事进行评分，评分标准包括语法性、相关性、创造性和吸引力。除创造性外，BART Large 在大多数方面的表现都优于人类写作者，总得分为 2.11 分，而人类写作的文本为 1.85 分，提高了 14%。在第二个实验中，定性分析显示，虽然 GPT-4o 表现出近乎完美的内部和外部一致性，但它倾向于产生更多可预测的叙事，只有 3% 的故事被认为是新颖的。这项研究从定量和定性两个方面揭示了模型大小和微调如何影响创意写作任务中创意、流畅性和一致性之间的平衡。

{"title":"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":"https://doi.org/arxiv-2409.11547","url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\u0000fine-tuned small language model (SLM), BART Large, and compare its performance\u0000to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\u0000evaluation consists of two experiments: (i) a human evaluation where readers\u0000assess the stories generated by the SLM compared to human-written stories, and\u0000(ii) a qualitative linguistic analysis comparing the textual characteristics of\u0000the stories generated by the different models. In the first experiment, we\u0000asked 68 participants to rate short stories generated by the models and humans\u0000along dimensions such as grammaticality, relevance, creativity, and\u0000attractiveness. BART Large outperformed human writers in most aspects, except\u0000creativity, with an overall score of 2.11 compared to 1.85 for human-written\u0000texts -- a 14% improvement. In the second experiment, the qualitative analysis\u0000revealed that, while GPT-4o exhibited near-perfect internal and external\u0000coherence, it tended to produce more predictable narratives, with only 3% of\u0000its stories seen as novel. In contrast, 15% of BART's stories were considered\u0000novel, indicating a higher degree of creativity despite its smaller model size.\u0000This study provides both quantitative and qualitative insights into how model\u0000size and fine-tuning influence the balance between creativity, fluency, and\u0000coherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration SAGED：可定制公平校准的语言模型整体偏差基准管道

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11149

Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu

The development of unbiased large language models is widely recognized ascrucial, yet existing benchmarks fall short in detecting biases due to limitedscope, contamination, and lack of a fairness baseline. SAGED(-Bias) is thefirst holistic benchmarking pipeline to address these problems. The pipelineencompasses five core stages: scraping materials, assembling benchmarks,generating responses, extracting numeric features, and diagnosing withdisparity metrics. SAGED includes metrics for max disparity, such as impactratio, and bias concentration, such as Max Z-scores. Noticing that assessmenttool bias and contextual bias in prompts can distort evaluation, SAGEDimplements counterfactual branching and baseline calibration for mitigation.For demonstration, we use SAGED on G20 Countries with popular 8b-level modelsincluding Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, wefind that while Mistral and Qwen2 show lower max disparity and higher biasconcentration than Gemma2 and Llama3.1, all models are notably biased againstcountries like Russia and (except for Qwen2) China. With further experiments tohave models role-playing U.S. (vice-/former-) presidents, we see bias amplifiesand shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral notengage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably moreintensively than Biden and Harris, indicating role-playing performance bias inthese models.

开发无偏大型语言模型被广泛认为是至关重要的，然而现有的基准由于范围有限、污染和缺乏公平基线，在检测偏差方面存在不足。SAGED(-Bias) 是第一个解决这些问题的整体基准管道。该管道包括五个核心阶段：收集材料、组合基准、生成响应、提取数字特征和诊断差异度量。SAGED 包括最大差异度量（如影响比率）和偏差集中度（如最大 Z 分数）。SAGED 注意到提示中的评估工具偏差和上下文偏差可能会扭曲评估，因此实现了反事实分支和基线校准以减轻影响。通过情感分析，我们发现，与 Gemma2 和 Llama3.1 相比，Mistral 和 Qwen2 显示出较低的最大差异和较高的偏差集中度，但所有模型都明显对俄罗斯和中国（Qwen2 除外）等国家存在偏差。在进一步让模型扮演美国（副总统/前总统）的实验中，我们发现偏差扩大了，并向不同的方向移动。此外，我们发现 Qwen2 和 Mistral 不参与角色扮演，而 Llama3.1 和 Gemma2 对特朗普的角色扮演明显比拜登和哈里斯更深入，这表明这些模型在角色扮演方面存在偏差。

{"title":"SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration","authors":"Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu","doi":"arxiv-2409.11149","DOIUrl":"https://doi.org/arxiv-2409.11149","url":null,"abstract":"The development of unbiased large language models is widely recognized as\u0000crucial, yet existing benchmarks fall short in detecting biases due to limited\u0000scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the\u0000first holistic benchmarking pipeline to address these problems. The pipeline\u0000encompasses five core stages: scraping materials, assembling benchmarks,\u0000generating responses, extracting numeric features, and diagnosing with\u0000disparity metrics. SAGED includes metrics for max disparity, such as impact\u0000ratio, and bias concentration, such as Max Z-scores. Noticing that assessment\u0000tool bias and contextual bias in prompts can distort evaluation, SAGED\u0000implements counterfactual branching and baseline calibration for mitigation.\u0000For demonstration, we use SAGED on G20 Countries with popular 8b-level models\u0000including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we\u0000find that while Mistral and Qwen2 show lower max disparity and higher bias\u0000concentration than Gemma2 and Llama3.1, all models are notably biased against\u0000countries like Russia and (except for Qwen2) China. With further experiments to\u0000have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies\u0000and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not\u0000engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more\u0000intensively than Biden and Harris, indicating role-playing performance bias in\u0000these models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Norm of Mean Contextualized Embeddings Determines their Variance 均值上下文嵌入的规范决定其方差

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11253

Hiroaki Yamagiwa, Hidetoshi Shimodaira

Contextualized embeddings vary by context, even for the same token, and forma distribution in the embedding space. To analyze this distribution, we focuson the norm of the mean embedding and the variance of the embeddings. In thisstudy, we first demonstrate that these values follow the well-known formula forvariance in statistics and provide an efficient sequential computation method.Then, by observing embeddings from intermediate layers of several Transformermodels, we found a strong trade-off relationship between the norm and thevariance: as the mean embedding becomes closer to the origin, the varianceincreases. This trade-off is likely influenced by the layer normalizationmechanism used in Transformer models. Furthermore, when the sets of tokenembeddings are treated as clusters, we show that the variance of the entireembedding set can theoretically be decomposed into the within-cluster varianceand the between-cluster variance. We found experimentally that as the layers ofTransformer models deepen, the embeddings move farther from the origin, thebetween-cluster variance relatively decreases, and the within-cluster variancerelatively increases. These results are consistent with existing studies on theanisotropy of the embedding spaces across layers.

语境化嵌入因语境而异，即使是同一个标记，也会在嵌入空间中形成分布。为了分析这种分布，我们将重点放在平均嵌入的规范和嵌入的方差上。在这项研究中，我们首先证明了这些值遵循统计学中著名的方差公式，并提供了一种高效的顺序计算方法。然后，通过观察几个变换模型中间层的嵌入，我们发现了规范和方差之间的强烈权衡关系：当平均嵌入变得更接近原点时，方差就会增大。这种取舍关系很可能受到 Transformer 模型中使用的层归一化机制的影响。此外，当标记嵌入集被视为簇时，我们发现理论上整个嵌入集的方差可以分解为簇内方差和簇间方差。我们在实验中发现，随着转换器模型层数的加深，嵌入会远离原点，簇间方差会相对减小，而簇内方差会相对增大。这些结果与现有关于各层嵌入空间各向异性的研究相一致。

{"title":"Norm of Mean Contextualized Embeddings Determines their Variance","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":"https://doi.org/arxiv-2409.11253","url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\u0000a distribution in the embedding space. To analyze this distribution, we focus\u0000on the norm of the mean embedding and the variance of the embeddings. In this\u0000study, we first demonstrate that these values follow the well-known formula for\u0000variance in statistics and provide an efficient sequential computation method.\u0000Then, by observing embeddings from intermediate layers of several Transformer\u0000models, we found a strong trade-off relationship between the norm and the\u0000variance: as the mean embedding becomes closer to the origin, the variance\u0000increases. This trade-off is likely influenced by the layer normalization\u0000mechanism used in Transformer models. Furthermore, when the sets of token\u0000embeddings are treated as clusters, we show that the variance of the entire\u0000embedding set can theoretically be decomposed into the within-cluster variance\u0000and the between-cluster variance. We found experimentally that as the layers of\u0000Transformer models deepen, the embeddings move farther from the origin, the\u0000between-cluster variance relatively decreases, and the within-cluster variance\u0000relatively increases. These results are consistent with existing studies on the\u0000anisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task Arithmetic for Language Expansion in Speech Translation 语音翻译中语言扩展的任务算术

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11274

Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe

Recent advances in large language models (LLMs) have gained interest inspeech-text multimodal foundation models, achieving strong performance oninstruction-based speech translation (ST). However, expanding language pairsfrom an existing instruction-tuned ST system is costly due to the necessity ofre-training on a combination of new and previous datasets. We propose to expandnew language pairs by merging the model trained on new language pairs and theexisting model, using task arithmetic. We find that the direct application oftask arithmetic for ST causes the merged model to fail to follow instructions;thus, generating translation in incorrect languages. To eliminate languageconfusion, we propose an augmented task arithmetic method that merges anadditional language control model. It is trained to generate the correct targetlanguage token following the instructions. Our experiments demonstrate that ourproposed language control model can achieve language expansion by eliminatinglanguage confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate theuse of our task arithmetic framework can expand to a language pair whereneither paired ST training data nor a pre-trained ST model is available. Wefirst synthesize the ST system from machine translation (MT) systems via taskanalogy, then merge the synthesized ST system to the existing ST model.

最近，大型语言模型（LLMs）在语音-文本多模态基础模型方面取得了巨大进步，在基于指令的语音翻译（ST）方面表现出色。然而，从现有的指令调整语音翻译系统中扩展语言对代价高昂，因为必须在新的和以前的数据集上进行重新训练。我们建议使用任务演算法，通过合并在新语言对上训练的模型和现有模型来扩展新语言对。我们发现，将任务运算直接应用于 ST 会导致合并后的模型无法遵循指令，从而产生错误语言的翻译。为了消除语言混淆，我们提出了一种增强任务演算法，该方法合并了一个额外的语言控制模型。经过训练，该模型可以根据指令生成正确的目标语言标记。实验证明，我们提出的语言控制模型可以通过消除语言混淆实现语言扩展。在我们的 MuST-C 和 CoVoST-2 实验中，它的 BLEU 分数分别提高了 4.66 和 4.92。此外，我们还证明了使用我们的任务运算框架可以扩展到既没有配对 ST 训练数据也没有预训练 ST 模型的语言对。我们首先通过任务演算法从机器翻译（MT）系统中合成 ST 系统，然后将合成的 ST 系统与现有的 ST 模型合并。

{"title":"Task Arithmetic for Language Expansion in Speech Translation","authors":"Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe","doi":"arxiv-2409.11274","DOIUrl":"https://doi.org/arxiv-2409.11274","url":null,"abstract":"Recent advances in large language models (LLMs) have gained interest in\u0000speech-text multimodal foundation models, achieving strong performance on\u0000instruction-based speech translation (ST). However, expanding language pairs\u0000from an existing instruction-tuned ST system is costly due to the necessity of\u0000re-training on a combination of new and previous datasets. We propose to expand\u0000new language pairs by merging the model trained on new language pairs and the\u0000existing model, using task arithmetic. We find that the direct application of\u0000task arithmetic for ST causes the merged model to fail to follow instructions;\u0000thus, generating translation in incorrect languages. To eliminate language\u0000confusion, we propose an augmented task arithmetic method that merges an\u0000additional language control model. It is trained to generate the correct target\u0000language token following the instructions. Our experiments demonstrate that our\u0000proposed language control model can achieve language expansion by eliminating\u0000language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66\u0000and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the\u0000use of our task arithmetic framework can expand to a language pair where\u0000neither paired ST training data nor a pre-trained ST model is available. We\u0000first synthesize the ST system from machine translation (MT) systems via task\u0000analogy, then merge the synthesized ST system to the existing ST model.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Says Who? Effective Zero-Shot Annotation of Focalization 谁说的？聚焦的有效零点注释

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11390

Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan

Focalization, the perspective through which narrative is presented, isencoded via a wide range of lexico-grammatical features and is subject toreader interpretation. Moreover, trained readers regularly disagree oninterpretations, suggesting that this problem may be computationallyintractable. In this paper, we provide experiments to test how wellcontemporary Large Language Models (LLMs) perform when annotating literarytexts for focalization mode. Despite the challenging nature of the task, LLMsshow comparable performance to trained human annotators in our experiments. Weprovide a case study working with the novels of Stephen King to demonstrate theusefulness of this approach for computational literary studies, illustratinghow focalization can be studied at scale.

聚焦是叙述呈现的视角，它通过一系列词汇语法特征进行编码，并受制于读者的解释。此外，训练有素的读者经常会在解释上出现分歧，这表明这个问题在计算上可能很棘手。在本文中，我们通过实验测试了当代大型语言模型（LLM）在注释聚焦模式文学文本时的表现。尽管这项任务极具挑战性，但在我们的实验中，LLM 的表现与训练有素的人类注释者不相上下。我们以斯蒂芬-金（Stephen King）的小说为案例，展示了这种方法在计算文学研究中的实用性，并说明了如何对聚焦进行大规模研究。

引用次数: 0

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration CoCA：通过宪法校准恢复多模态大型语言模型的安全意识

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11365

Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li

The deployment of multimodal large language models (MLLMs) has demonstratedremarkable success in engaging in conversations involving visual inputs, thanksto the superior power of large language models (LLMs). Those MLLMs aretypically built based on the LLMs, with an image encoder to process images intothe token embedding space of the LLMs. However, the integration of visualmodality has introduced a unique vulnerability: the MLLM becomes susceptible tomalicious visual inputs and prone to generating sensitive or harmful responses,even though the LLM has been trained on textual dataset to align with humanvalue. In this paper, we first raise the question: ``Do the MLLMs possesssafety-awareness against malicious image inputs?". We find that after adding aprinciple that specifies the safety requirement into the input of the MLLM, themodel's safety awareness becomes boosted. This phenomenon verifies theexistence of MLLM's safety-awareness against image inputs, it is only weakenedby the modality gap. We then introduce a simple yet effective technique termedCoCA, which amplifies the safety-awareness of the MLLM by calibrating itsoutput distribution. Our proposed strategy helps the model reclaim its originalsafety awareness without losing its original capabilities. We verify theeffectiveness of our approach on both multimodal safety and understandingbenchmarks.

多模态大型语言模型（MLLMs）的应用在涉及视觉输入的对话中取得了显著的成功，这要归功于大型语言模型（LLMs）的卓越功能。这些 MLLM 通常以 LLM 为基础构建，并使用图像编码器在 LLM 的标记嵌入空间中处理图像。然而，视觉模式的集成带来了一个独特的弱点：MLLM 容易受到恶意视觉输入的影响，并容易产生敏感或有害的反应，即使 LLM 已经在文本数据集上进行了训练，以符合人类的价值。在本文中，我们首先提出了一个问题："MLLM 对恶意图像输入具有安全意识吗？我们发现，在MLLM的输入中加入一个指定安全要求的原则后，模型的安全意识得到了提升。这一现象验证了 MLLM 对图像输入的安全意识的存在，只是由于模态差距而被削弱了。然后，我们引入了一种简单而有效的技术，即校准（CoCA）技术，它通过校准 MLLM 的输出分布来增强 MLLM 的安全意识。我们提出的策略有助于模型恢复其原有的安全意识，同时又不丧失其原有的能力。我们在多模态安全和理解基准测试中验证了这种方法的有效性。

{"title":"CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration","authors":"Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li","doi":"arxiv-2409.11365","DOIUrl":"https://doi.org/arxiv-2409.11365","url":null,"abstract":"The deployment of multimodal large language models (MLLMs) has demonstrated\u0000remarkable success in engaging in conversations involving visual inputs, thanks\u0000to the superior power of large language models (LLMs). Those MLLMs are\u0000typically built based on the LLMs, with an image encoder to process images into\u0000the token embedding space of the LLMs. However, the integration of visual\u0000modality has introduced a unique vulnerability: the MLLM becomes susceptible to\u0000malicious visual inputs and prone to generating sensitive or harmful responses,\u0000even though the LLM has been trained on textual dataset to align with human\u0000value. In this paper, we first raise the question: ``Do the MLLMs possess\u0000safety-awareness against malicious image inputs?\". We find that after adding a\u0000principle that specifies the safety requirement into the input of the MLLM, the\u0000model's safety awareness becomes boosted. This phenomenon verifies the\u0000existence of MLLM's safety-awareness against image inputs, it is only weakened\u0000by the modality gap. We then introduce a simple yet effective technique termed\u0000CoCA, which amplifies the safety-awareness of the MLLM by calibrating its\u0000output distribution. Our proposed strategy helps the model reclaim its original\u0000safety awareness without losing its original capabilities. We verify the\u0000effectiveness of our approach on both multimodal safety and understanding\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement 多样化与征服：以多样性为中心的数据选择与迭代改进

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11378

Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee

Finetuning large language models on instruction data is crucial for enhancingpre-trained knowledge and improving instruction-following capabilities. Asinstruction datasets proliferate, selecting optimal data for effective trainingbecomes increasingly important. This work addresses the question: How can wedetermine the optimal subset of data for effective training? While existingresearch often emphasizes local criteria like instance quality for subsetselection, we argue that a global approach focused on data diversity is morecritical. Our method employs k-means clustering to ensure the selected subseteffectively represents the full dataset. We propose an iterative refinementmethod inspired by active learning techniques to resample instances fromclusters, reassessing each cluster's importance and sampling weight in everytraining iteration. This approach reduces the effect of outliers andautomatically filters out clusters containing low-quality data. Throughextensive evaluation across natural language reasoning, general worldknowledge, code and math reasoning tasks, and by fine-tuning models fromvarious families, we observe consistent improvements, achieving a 7% increaseover random selection and a 3.8% improvement over state-of-the-art samplingmethods. Our work highlights the significance of diversity-first sampling whenfinetuning LLMs to enhance performance across a broad array of evaluationtasks. Our code is available athttps://github.com/for-ai/iterative-data-selection.

在教学数据上对大型语言模型进行微调，对于增强预先训练的知识和提高教学能力至关重要。随着教学数据集的增多，选择最佳数据进行有效训练变得越来越重要。这项工作要解决的问题是：如何为有效训练确定最佳数据子集？虽然现有研究通常强调子集选择的局部标准，如实例质量，但我们认为，以数据多样性为重点的全局方法更为关键。我们的方法采用 k 均值聚类，以确保所选子集能有效代表整个数据集。我们提出了一种受主动学习技术启发的迭代改进方法，从聚类中重新抽取实例，在每次训练迭代中重新评估每个聚类的重要性和抽样权重。这种方法可以减少异常值的影响，并自动过滤掉包含低质量数据的聚类。通过对自然语言推理、一般世界知识、代码和数学推理任务的广泛评估，以及对不同系列模型的微调，我们观察到了一致的改进，比随机选择提高了 7%，比最先进的抽样方法提高了 3.8%。我们的工作凸显了多样性优先采样在调整 LLM 以提高各种评估任务性能时的重要性。我们的代码可在https://github.com/for-ai/iterative-data-selection。

{"title":"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":"https://doi.org/arxiv-2409.11378","url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\u0000pre-trained knowledge and improving instruction-following capabilities. As\u0000instruction datasets proliferate, selecting optimal data for effective training\u0000becomes increasingly important. This work addresses the question: How can we\u0000determine the optimal subset of data for effective training? While existing\u0000research often emphasizes local criteria like instance quality for subset\u0000selection, we argue that a global approach focused on data diversity is more\u0000critical. Our method employs k-means clustering to ensure the selected subset\u0000effectively represents the full dataset. We propose an iterative refinement\u0000method inspired by active learning techniques to resample instances from\u0000clusters, reassessing each cluster's importance and sampling weight in every\u0000training iteration. This approach reduces the effect of outliers and\u0000automatically filters out clusters containing low-quality data. Through\u0000extensive evaluation across natural language reasoning, general world\u0000knowledge, code and math reasoning tasks, and by fine-tuning models from\u0000various families, we observe consistent improvements, achieving a 7% increase\u0000over random selection and a 3.8% improvement over state-of-the-art sampling\u0000methods. Our work highlights the significance of diversity-first sampling when\u0000finetuning LLMs to enhance performance across a broad array of evaluation\u0000tasks. Our code is available at\u0000https://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5 利用蒸馏技术理解文档：FLAN-T5 案例研究

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11282

Marcel Lamott, Muhammad Armaghan Shakir

The surge of digital documents in various formats, including lessstandardized documents such as business reports and environmental assessments,underscores the growing importance of Document Understanding. While LargeLanguage Models (LLMs) have showcased prowess across diverse natural languageprocessing tasks, their direct application to Document Understanding remains achallenge. Previous research has demonstrated the utility of LLMs in thisdomain, yet their significant computational demands make them challenging todeploy effectively. Additionally, proprietary Blackbox LLMs often outperformtheir open-source counterparts, posing a barrier to widespread accessibility.In this paper, we delve into the realm of document understanding, leveragingdistillation methods to harness the power of large LLMs while accommodatingcomputational limitations. Specifically, we present a novel approach wherein wedistill document understanding knowledge from the proprietary LLM ChatGPT intoFLAN-T5. Our methodology integrates labeling and curriculum-learning mechanismsto facilitate efficient knowledge transfer. This work contributes to theadvancement of document understanding methodologies by offering a scalablesolution that bridges the gap between resource-intensive LLMs and practicalapplications. Our findings underscore the potential of distillation techniquesin facilitating the deployment of sophisticated language models in real-worldscenarios, thereby fostering advancements in natural language processing anddocument comprehension domains.

各种格式的数字文档，包括商业报告和环境评估等标准化程度较低的文档的激增，凸显了文档理解的重要性与日俱增。虽然大型语言模型（LLMs）在各种自然语言处理任务中表现出了卓越的能力，但将其直接应用于文档理解仍然是一个挑战。以前的研究已经证明了 LLM 在这一领域的实用性，但其巨大的计算需求使其难以有效部署。在本文中，我们将深入探讨文档理解领域，利用蒸馏方法利用大型 LLM 的强大功能，同时兼顾计算限制。具体来说，我们提出了一种新方法，将专有 LLM ChatGPT 中的文档理解知识蒸馏到FLAN-T5 中。我们的方法整合了标签和课程学习机制，以促进高效的知识转移。这项工作提供了一种可扩展的解决方案，缩小了资源密集型 LLM 与实际应用之间的差距，从而推动了文档理解方法的发展。我们的研究结果强调了蒸馏技术在促进复杂语言模型在现实世界场景中的部署方面的潜力，从而促进了自然语言处理和文档理解领域的进步。

{"title":"Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5","authors":"Marcel Lamott, Muhammad Armaghan Shakir","doi":"arxiv-2409.11282","DOIUrl":"https://doi.org/arxiv-2409.11282","url":null,"abstract":"The surge of digital documents in various formats, including less\u0000standardized documents such as business reports and environmental assessments,\u0000underscores the growing importance of Document Understanding. While Large\u0000Language Models (LLMs) have showcased prowess across diverse natural language\u0000processing tasks, their direct application to Document Understanding remains a\u0000challenge. Previous research has demonstrated the utility of LLMs in this\u0000domain, yet their significant computational demands make them challenging to\u0000deploy effectively. Additionally, proprietary Blackbox LLMs often outperform\u0000their open-source counterparts, posing a barrier to widespread accessibility.\u0000In this paper, we delve into the realm of document understanding, leveraging\u0000distillation methods to harness the power of large LLMs while accommodating\u0000computational limitations. Specifically, we present a novel approach wherein we\u0000distill document understanding knowledge from the proprietary LLM ChatGPT into\u0000FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms\u0000to facilitate efficient knowledge transfer. This work contributes to the\u0000advancement of document understanding methodologies by offering a scalable\u0000solution that bridges the gap between resource-intensive LLMs and practical\u0000applications. Our findings underscore the potential of distillation techniques\u0000in facilitating the deployment of sophisticated language models in real-world\u0000scenarios, thereby fostering advancements in natural language processing and\u0000document comprehension domains.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpMis: An Investigation of Synthetic Spoken Misinformation Detection SpMis：合成语音错误信息检测研究

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11308

Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu

In recent years, speech generation technology has advanced rapidly, fueled bygenerative models and large-scale training techniques. While these developmentshave enabled the production of high-quality synthetic speech, they have alsoraised concerns about the misuse of this technology, particularly forgenerating synthetic misinformation. Current research primarily focuses ondistinguishing machine-generated speech from human-produced speech, but themore urgent challenge is detecting misinformation within spoken content. Thistask requires a thorough analysis of factors such as speaker identity, topic,and synthesis. To address this need, we conduct an initial investigation intosynthetic spoken misinformation detection by introducing an open-sourcedataset, SpMis. SpMis includes speech synthesized from over 1,000 speakersacross five common topics, utilizing state-of-the-art text-to-speech systems.Although our results show promising detection capabilities, they also revealsubstantial challenges for practical implementation, underscoring theimportance of ongoing research in this critical area.

近年来，语音生成技术在生成模型和大规模训练技术的推动下发展迅速。虽然这些发展使高质量合成语音的生成成为可能，但同时也引发了对滥用该技术的担忧，尤其是生成合成错误信息。目前的研究主要集中在区分机器生成的语音和人类生成的语音，但更紧迫的挑战是检测口语内容中的错误信息。这项任务要求对说话人身份、话题和合成等因素进行全面分析。为了满足这一需求，我们引入了一个开源数据集 SpMis，对合成语音错误信息检测进行了初步研究。虽然我们的结果显示了良好的检测能力，但同时也揭示了实际应用中的巨大挑战，强调了在这一关键领域持续开展研究的重要性。

{"title":"SpMis: An Investigation of Synthetic Spoken Misinformation Detection","authors":"Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu","doi":"arxiv-2409.11308","DOIUrl":"https://doi.org/arxiv-2409.11308","url":null,"abstract":"In recent years, speech generation technology has advanced rapidly, fueled by\u0000generative models and large-scale training techniques. While these developments\u0000have enabled the production of high-quality synthetic speech, they have also\u0000raised concerns about the misuse of this technology, particularly for\u0000generating synthetic misinformation. Current research primarily focuses on\u0000distinguishing machine-generated speech from human-produced speech, but the\u0000more urgent challenge is detecting misinformation within spoken content. This\u0000task requires a thorough analysis of factors such as speaker identity, topic,\u0000and synthesis. To address this need, we conduct an initial investigation into\u0000synthetic spoken misinformation detection by introducing an open-source\u0000dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers\u0000across five common topics, utilizing state-of-the-art text-to-speech systems.\u0000Although our results show promising detection capabilities, they also reveal\u0000substantial challenges for practical implementation, underscoring the\u0000importance of ongoing research in this critical area.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse 通过基础归因和学会拒绝来衡量和提高 RAG 中法律硕士的可信度

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11242

Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria

LLMs are an integral part of retrieval-augmented generation (RAG) systems.While many studies focus on evaluating the quality of end-to-end RAG systems,there is a lack of research on understanding the appropriateness of an LLM forthe RAG task. Thus, we introduce a new metric, Trust-Score, that provides aholistic evaluation of the trustworthiness of LLMs in an RAG framework. We showthat various prompting methods, such as in-context learning, fail to adapt LLMseffectively to the RAG task. Thus, we propose Trust-Align, a framework to alignLLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantlyoutperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up29.2) and ELI5 (up 14.9). We release our code at:https://github.com/declare-lab/trust-align.

尽管许多研究都侧重于评估端到端 RAG 系统的质量，但在了解 LLM 是否适合 RAG 任务方面却缺乏研究。因此，我们引入了一个新指标--信任分数（Trust-Score），对 RAG 框架中 LLM 的可信度进行整体评估。我们的研究表明，各种提示方法（如上下文学习）都无法使 LLM 有效地适应 RAG 任务。因此，我们提出了 Trust-Align（信任对齐）--一种对齐 LLM 以获得更高的信任分数的框架。采用我们的方法对齐的 LLaMA-3-8b 在 ASQA（提高 10.7）、QAMPARI（提高 29.2）和 ELI5（提高 14.9）上的表现明显优于同等规模的开源 LLM。我们发布了我们的代码：https://github.com/declare-lab/trust-align。

引用次数: 0