In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART Large, and compare its performance to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human evaluation where readers assess the stories generated by the SLM compared to human-written stories, and (ii) a qualitative linguistic analysis comparing the textual characteristics of the stories generated by the different models. In the first experiment, we asked 68 participants to rate short stories generated by the models and humans along dimensions such as grammaticality, relevance, creativity, and attractiveness. BART Large outperformed human writers in most aspects, except creativity, with an overall score of 2.11 compared to 1.85 for human-written texts -- a 14% improvement. In the second experiment, the qualitative analysis revealed that, while GPT-4o exhibited near-perfect internal and external coherence, it tended to produce more predictable narratives, with only 3% of its stories seen as novel. In contrast, 15% of BART's stories were considered novel, indicating a higher degree of creativity despite its smaller model size. This study provides both quantitative and qualitative insights into how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks.
在本文中,我们评估了经过精细调整的小语言模型(SLM)BART Large 的小说创作能力,并将其表现与人类和两种大语言模型(LLM)进行了比较:GPT-3.5 和 GPT-4o。评估包括两个实验:(i)人类评估,读者将 SLM 生成的故事与人类编写的故事进行比较评估;(ii)定性语言分析,比较不同模型生成的故事的文本特征。在第一个实验中,我们请 68 名参与者对模型和人类编写的短篇故事进行评分,评分标准包括语法性、相关性、创造性和吸引力。除创造性外,BART Large 在大多数方面的表现都优于人类写作者,总得分为 2.11 分,而人类写作的文本为 1.85 分,提高了 14%。在第二个实验中,定性分析显示,虽然 GPT-4o 表现出近乎完美的内部和外部一致性,但它倾向于产生更多可预测的叙事,只有 3% 的故事被认为是新颖的。这项研究从定量和定性两个方面揭示了模型大小和微调如何影响创意写作任务中创意、流畅性和一致性之间的平衡。
{"title":"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":"https://doi.org/arxiv-2409.11547","url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\u0000fine-tuned small language model (SLM), BART Large, and compare its performance\u0000to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\u0000evaluation consists of two experiments: (i) a human evaluation where readers\u0000assess the stories generated by the SLM compared to human-written stories, and\u0000(ii) a qualitative linguistic analysis comparing the textual characteristics of\u0000the stories generated by the different models. In the first experiment, we\u0000asked 68 participants to rate short stories generated by the models and humans\u0000along dimensions such as grammaticality, relevance, creativity, and\u0000attractiveness. BART Large outperformed human writers in most aspects, except\u0000creativity, with an overall score of 2.11 compared to 1.85 for human-written\u0000texts -- a 14% improvement. In the second experiment, the qualitative analysis\u0000revealed that, while GPT-4o exhibited near-perfect internal and external\u0000coherence, it tended to produce more predictable narratives, with only 3% of\u0000its stories seen as novel. In contrast, 15% of BART's stories were considered\u0000novel, indicating a higher degree of creativity despite its smaller model size.\u0000This study provides both quantitative and qualitative insights into how model\u0000size and fine-tuning influence the balance between creativity, fluency, and\u0000coherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that assessment tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
{"title":"SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration","authors":"Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu","doi":"arxiv-2409.11149","DOIUrl":"https://doi.org/arxiv-2409.11149","url":null,"abstract":"The development of unbiased large language models is widely recognized as\u0000crucial, yet existing benchmarks fall short in detecting biases due to limited\u0000scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the\u0000first holistic benchmarking pipeline to address these problems. The pipeline\u0000encompasses five core stages: scraping materials, assembling benchmarks,\u0000generating responses, extracting numeric features, and diagnosing with\u0000disparity metrics. SAGED includes metrics for max disparity, such as impact\u0000ratio, and bias concentration, such as Max Z-scores. Noticing that assessment\u0000tool bias and contextual bias in prompts can distort evaluation, SAGED\u0000implements counterfactual branching and baseline calibration for mitigation.\u0000For demonstration, we use SAGED on G20 Countries with popular 8b-level models\u0000including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we\u0000find that while Mistral and Qwen2 show lower max disparity and higher bias\u0000concentration than Gemma2 and Llama3.1, all models are notably biased against\u0000countries like Russia and (except for Qwen2) China. With further experiments to\u0000have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies\u0000and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not\u0000engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more\u0000intensively than Biden and Harris, indicating role-playing performance bias in\u0000these models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
{"title":"Norm of Mean Contextualized Embeddings Determines their Variance","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":"https://doi.org/arxiv-2409.11253","url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\u0000a distribution in the embedding space. To analyze this distribution, we focus\u0000on the norm of the mean embedding and the variance of the embeddings. In this\u0000study, we first demonstrate that these values follow the well-known formula for\u0000variance in statistics and provide an efficient sequential computation method.\u0000Then, by observing embeddings from intermediate layers of several Transformer\u0000models, we found a strong trade-off relationship between the norm and the\u0000variance: as the mean embedding becomes closer to the origin, the variance\u0000increases. This trade-off is likely influenced by the layer normalization\u0000mechanism used in Transformer models. Furthermore, when the sets of token\u0000embeddings are treated as clusters, we show that the variance of the entire\u0000embedding set can theoretically be decomposed into the within-cluster variance\u0000and the between-cluster variance. We found experimentally that as the layers of\u0000Transformer models deepen, the embeddings move farther from the origin, the\u0000between-cluster variance relatively decreases, and the within-cluster variance\u0000relatively increases. These results are consistent with existing studies on the\u0000anisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instruction-tuned ST system is costly due to the necessity of re-training on a combination of new and previous datasets. We propose to expand new language pairs by merging the model trained on new language pairs and the existing model, using task arithmetic. We find that the direct application of task arithmetic for ST causes the merged model to fail to follow instructions; thus, generating translation in incorrect languages. To eliminate language confusion, we propose an augmented task arithmetic method that merges an additional language control model. It is trained to generate the correct target language token following the instructions. Our experiments demonstrate that our proposed language control model can achieve language expansion by eliminating language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66 and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the use of our task arithmetic framework can expand to a language pair where neither paired ST training data nor a pre-trained ST model is available. We first synthesize the ST system from machine translation (MT) systems via task analogy, then merge the synthesized ST system to the existing ST model.
最近,大型语言模型(LLMs)在语音-文本多模态基础模型方面取得了巨大进步,在基于指令的语音翻译(ST)方面表现出色。然而,从现有的指令调整语音翻译系统中扩展语言对代价高昂,因为必须在新的和以前的数据集上进行重新训练。我们建议使用任务演算法,通过合并在新语言对上训练的模型和现有模型来扩展新语言对。我们发现,将任务运算直接应用于 ST 会导致合并后的模型无法遵循指令,从而产生错误语言的翻译。为了消除语言混淆,我们提出了一种增强任务演算法,该方法合并了一个额外的语言控制模型。经过训练,该模型可以根据指令生成正确的目标语言标记。实验证明,我们提出的语言控制模型可以通过消除语言混淆实现语言扩展。在我们的 MuST-C 和 CoVoST-2 实验中,它的 BLEU 分数分别提高了 4.66 和 4.92。此外,我们还证明了使用我们的任务运算框架可以扩展到既没有配对 ST 训练数据也没有预训练 ST 模型的语言对。我们首先通过任务演算法从机器翻译(MT)系统中合成 ST 系统,然后将合成的 ST 系统与现有的 ST 模型合并。
{"title":"Task Arithmetic for Language Expansion in Speech Translation","authors":"Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe","doi":"arxiv-2409.11274","DOIUrl":"https://doi.org/arxiv-2409.11274","url":null,"abstract":"Recent advances in large language models (LLMs) have gained interest in\u0000speech-text multimodal foundation models, achieving strong performance on\u0000instruction-based speech translation (ST). However, expanding language pairs\u0000from an existing instruction-tuned ST system is costly due to the necessity of\u0000re-training on a combination of new and previous datasets. We propose to expand\u0000new language pairs by merging the model trained on new language pairs and the\u0000existing model, using task arithmetic. We find that the direct application of\u0000task arithmetic for ST causes the merged model to fail to follow instructions;\u0000thus, generating translation in incorrect languages. To eliminate language\u0000confusion, we propose an augmented task arithmetic method that merges an\u0000additional language control model. It is trained to generate the correct target\u0000language token following the instructions. Our experiments demonstrate that our\u0000proposed language control model can achieve language expansion by eliminating\u0000language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66\u0000and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the\u0000use of our task arithmetic framework can expand to a language pair where\u0000neither paired ST training data nor a pre-trained ST model is available. We\u0000first synthesize the ST system from machine translation (MT) systems via task\u0000analogy, then merge the synthesized ST system to the existing ST model.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan
Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Moreover, trained readers regularly disagree on interpretations, suggesting that this problem may be computationally intractable. In this paper, we provide experiments to test how well contemporary Large Language Models (LLMs) perform when annotating literary texts for focalization mode. Despite the challenging nature of the task, LLMs show comparable performance to trained human annotators in our experiments. We provide a case study working with the novels of Stephen King to demonstrate the usefulness of this approach for computational literary studies, illustrating how focalization can be studied at scale.
{"title":"Says Who? Effective Zero-Shot Annotation of Focalization","authors":"Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan","doi":"arxiv-2409.11390","DOIUrl":"https://doi.org/arxiv-2409.11390","url":null,"abstract":"Focalization, the perspective through which narrative is presented, is\u0000encoded via a wide range of lexico-grammatical features and is subject to\u0000reader interpretation. Moreover, trained readers regularly disagree on\u0000interpretations, suggesting that this problem may be computationally\u0000intractable. In this paper, we provide experiments to test how well\u0000contemporary Large Language Models (LLMs) perform when annotating literary\u0000texts for focalization mode. Despite the challenging nature of the task, LLMs\u0000show comparable performance to trained human annotators in our experiments. We\u0000provide a case study working with the novels of Stephen King to demonstrate the\u0000usefulness of this approach for computational literary studies, illustrating\u0000how focalization can be studied at scale.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li
The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?". We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model's safety awareness becomes boosted. This phenomenon verifies the existence of MLLM's safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks.
{"title":"CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration","authors":"Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li","doi":"arxiv-2409.11365","DOIUrl":"https://doi.org/arxiv-2409.11365","url":null,"abstract":"The deployment of multimodal large language models (MLLMs) has demonstrated\u0000remarkable success in engaging in conversations involving visual inputs, thanks\u0000to the superior power of large language models (LLMs). Those MLLMs are\u0000typically built based on the LLMs, with an image encoder to process images into\u0000the token embedding space of the LLMs. However, the integration of visual\u0000modality has introduced a unique vulnerability: the MLLM becomes susceptible to\u0000malicious visual inputs and prone to generating sensitive or harmful responses,\u0000even though the LLM has been trained on textual dataset to align with human\u0000value. In this paper, we first raise the question: ``Do the MLLMs possess\u0000safety-awareness against malicious image inputs?\". We find that after adding a\u0000principle that specifies the safety requirement into the input of the MLLM, the\u0000model's safety awareness becomes boosted. This phenomenon verifies the\u0000existence of MLLM's safety-awareness against image inputs, it is only weakened\u0000by the modality gap. We then introduce a simple yet effective technique termed\u0000CoCA, which amplifies the safety-awareness of the MLLM by calibrating its\u0000output distribution. Our proposed strategy helps the model reclaim its original\u0000safety awareness without losing its original capabilities. We verify the\u0000effectiveness of our approach on both multimodal safety and understanding\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.
在教学数据上对大型语言模型进行微调,对于增强预先训练的知识和提高教学能力至关重要。随着教学数据集的增多,选择最佳数据进行有效训练变得越来越重要。这项工作要解决的问题是:如何为有效训练确定最佳数据子集?虽然现有研究通常强调子集选择的局部标准,如实例质量,但我们认为,以数据多样性为重点的全局方法更为关键。我们的方法采用 k 均值聚类,以确保所选子集能有效代表整个数据集。我们提出了一种受主动学习技术启发的迭代改进方法,从聚类中重新抽取实例,在每次训练迭代中重新评估每个聚类的重要性和抽样权重。这种方法可以减少异常值的影响,并自动过滤掉包含低质量数据的聚类。通过对自然语言推理、一般世界知识、代码和数学推理任务的广泛评估,以及对不同系列模型的微调,我们观察到了一致的改进,比随机选择提高了 7%,比最先进的抽样方法提高了 3.8%。我们的工作凸显了多样性优先采样在调整 LLM 以提高各种评估任务性能时的重要性。我们的代码可在https://github.com/for-ai/iterative-data-selection。
{"title":"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":"https://doi.org/arxiv-2409.11378","url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\u0000pre-trained knowledge and improving instruction-following capabilities. As\u0000instruction datasets proliferate, selecting optimal data for effective training\u0000becomes increasingly important. This work addresses the question: How can we\u0000determine the optimal subset of data for effective training? While existing\u0000research often emphasizes local criteria like instance quality for subset\u0000selection, we argue that a global approach focused on data diversity is more\u0000critical. Our method employs k-means clustering to ensure the selected subset\u0000effectively represents the full dataset. We propose an iterative refinement\u0000method inspired by active learning techniques to resample instances from\u0000clusters, reassessing each cluster's importance and sampling weight in every\u0000training iteration. This approach reduces the effect of outliers and\u0000automatically filters out clusters containing low-quality data. Through\u0000extensive evaluation across natural language reasoning, general world\u0000knowledge, code and math reasoning tasks, and by fine-tuning models from\u0000various families, we observe consistent improvements, achieving a 7% increase\u0000over random selection and a 3.8% improvement over state-of-the-art sampling\u0000methods. Our work highlights the significance of diversity-first sampling when\u0000finetuning LLMs to enhance performance across a broad array of evaluation\u0000tasks. Our code is available at\u0000https://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The surge of digital documents in various formats, including less standardized documents such as business reports and environmental assessments, underscores the growing importance of Document Understanding. While Large Language Models (LLMs) have showcased prowess across diverse natural language processing tasks, their direct application to Document Understanding remains a challenge. Previous research has demonstrated the utility of LLMs in this domain, yet their significant computational demands make them challenging to deploy effectively. Additionally, proprietary Blackbox LLMs often outperform their open-source counterparts, posing a barrier to widespread accessibility. In this paper, we delve into the realm of document understanding, leveraging distillation methods to harness the power of large LLMs while accommodating computational limitations. Specifically, we present a novel approach wherein we distill document understanding knowledge from the proprietary LLM ChatGPT into FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms to facilitate efficient knowledge transfer. This work contributes to the advancement of document understanding methodologies by offering a scalable solution that bridges the gap between resource-intensive LLMs and practical applications. Our findings underscore the potential of distillation techniques in facilitating the deployment of sophisticated language models in real-world scenarios, thereby fostering advancements in natural language processing and document comprehension domains.
{"title":"Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5","authors":"Marcel Lamott, Muhammad Armaghan Shakir","doi":"arxiv-2409.11282","DOIUrl":"https://doi.org/arxiv-2409.11282","url":null,"abstract":"The surge of digital documents in various formats, including less\u0000standardized documents such as business reports and environmental assessments,\u0000underscores the growing importance of Document Understanding. While Large\u0000Language Models (LLMs) have showcased prowess across diverse natural language\u0000processing tasks, their direct application to Document Understanding remains a\u0000challenge. Previous research has demonstrated the utility of LLMs in this\u0000domain, yet their significant computational demands make them challenging to\u0000deploy effectively. Additionally, proprietary Blackbox LLMs often outperform\u0000their open-source counterparts, posing a barrier to widespread accessibility.\u0000In this paper, we delve into the realm of document understanding, leveraging\u0000distillation methods to harness the power of large LLMs while accommodating\u0000computational limitations. Specifically, we present a novel approach wherein we\u0000distill document understanding knowledge from the proprietary LLM ChatGPT into\u0000FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms\u0000to facilitate efficient knowledge transfer. This work contributes to the\u0000advancement of document understanding methodologies by offering a scalable\u0000solution that bridges the gap between resource-intensive LLMs and practical\u0000applications. Our findings underscore the potential of distillation techniques\u0000in facilitating the deployment of sophisticated language models in real-world\u0000scenarios, thereby fostering advancements in natural language processing and\u0000document comprehension domains.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu
In recent years, speech generation technology has advanced rapidly, fueled by generative models and large-scale training techniques. While these developments have enabled the production of high-quality synthetic speech, they have also raised concerns about the misuse of this technology, particularly for generating synthetic misinformation. Current research primarily focuses on distinguishing machine-generated speech from human-produced speech, but the more urgent challenge is detecting misinformation within spoken content. This task requires a thorough analysis of factors such as speaker identity, topic, and synthesis. To address this need, we conduct an initial investigation into synthetic spoken misinformation detection by introducing an open-source dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers across five common topics, utilizing state-of-the-art text-to-speech systems. Although our results show promising detection capabilities, they also reveal substantial challenges for practical implementation, underscoring the importance of ongoing research in this critical area.
{"title":"SpMis: An Investigation of Synthetic Spoken Misinformation Detection","authors":"Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu","doi":"arxiv-2409.11308","DOIUrl":"https://doi.org/arxiv-2409.11308","url":null,"abstract":"In recent years, speech generation technology has advanced rapidly, fueled by\u0000generative models and large-scale training techniques. While these developments\u0000have enabled the production of high-quality synthetic speech, they have also\u0000raised concerns about the misuse of this technology, particularly for\u0000generating synthetic misinformation. Current research primarily focuses on\u0000distinguishing machine-generated speech from human-produced speech, but the\u0000more urgent challenge is detecting misinformation within spoken content. This\u0000task requires a thorough analysis of factors such as speaker identity, topic,\u0000and synthesis. To address this need, we conduct an initial investigation into\u0000synthetic spoken misinformation detection by introducing an open-source\u0000dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers\u0000across five common topics, utilizing state-of-the-art text-to-speech systems.\u0000Although our results show promising detection capabilities, they also reveal\u0000substantial challenges for practical implementation, underscoring the\u0000importance of ongoing research in this critical area.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria
LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up 29.2) and ELI5 (up 14.9). We release our code at: https://github.com/declare-lab/trust-align.
{"title":"Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse","authors":"Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria","doi":"arxiv-2409.11242","DOIUrl":"https://doi.org/arxiv-2409.11242","url":null,"abstract":"LLMs are an integral part of retrieval-augmented generation (RAG) systems.\u0000While many studies focus on evaluating the quality of end-to-end RAG systems,\u0000there is a lack of research on understanding the appropriateness of an LLM for\u0000the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a\u0000holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show\u0000that various prompting methods, such as in-context learning, fail to adapt LLMs\u0000effectively to the RAG task. Thus, we propose Trust-Align, a framework to align\u0000LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly\u0000outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up\u000029.2) and ELI5 (up 14.9). We release our code at:\u0000https://github.com/declare-lab/trust-align.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}