Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
{"title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":"https://doi.org/arxiv-2409.11353","url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\u0000challenge in Large Language Models (LLMs). Existing detection and mitigation\u0000methods are often isolated and insufficient for domain-specific needs, lacking\u0000a standardized pipeline. This paper introduces THaMES (Tool for Hallucination\u0000Mitigations and EvaluationS), an integrated framework and library addressing\u0000this gap. THaMES offers an end-to-end solution for evaluating and mitigating\u0000hallucinations in LLMs, featuring automated test set generation, multifaceted\u0000benchmarking, and adaptable mitigation strategies. It automates test set\u0000creation from any corpus, ensuring high data quality, diversity, and\u0000cost-efficiency through techniques like batch processing, weighted sampling,\u0000and counterfactual validation. THaMES assesses a model's ability to detect and\u0000reduce hallucinations across various tasks, including text generation and\u0000binary classification, applying optimal mitigation strategies like In-Context\u0000Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\u0000Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\u0000of academic papers, political news, and Wikipedia reveal that commercial models\u0000like GPT-4o benefit more from RAG than ICL, while open-weight models like\u0000Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\u0000significantly enhances the performance of Llama-3.1-8B-Instruct in both\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Liu, Liming Zhan, Yujie Feng, Zexin Lu, Chengqiang Xie, Lei Xue, Xiao-Ming Wu, Albert Y. S. Lam
In the realm of task-oriented dialogue systems, a robust intent detection mechanism must effectively handle malformed utterances encountered in real-world scenarios. This study presents a novel fine-tuning framework for large language models (LLMs) aimed at enhancing in-distribution (ID) intent classification and out-of-distribution (OOD) intent detection, which utilizes semantic matching with prototypes derived from ID class names. By harnessing the highly distinguishable representations of LLMs, we construct semantic prototypes for each ID class using a diversity-grounded prompt tuning approach. We rigorously test our framework in a challenging OOD context, where ID and OOD classes are semantically close yet distinct, referred to as emph{near} OOD detection. For a thorough assessment, we benchmark our method against the prevalent fine-tuning approaches. The experimental findings reveal that our method demonstrates superior performance in both few-shot ID intent classification and near-OOD intent detection tasks.
在面向任务的对话系统领域,强大的意图检测机制必须能有效处理真实世界场景中遇到的畸形语句。本研究为大语言模型(LLMs)提出了一个新颖的微调框架,旨在增强分布内(ID)意图分类和分布外(OOD)意图检测,该框架利用了从 ID 类名衍生出的原型进行语义匹配。我们在具有挑战性的 OOD 环境中对我们的框架进行了严格测试,在这种环境中,ID 和 OOD 类别在语义上非常接近,但又截然不同,这被称为 "接近 "OOD 检测。OODdetection.为了进行全面评估,我们将我们的方法与流行的微调方法进行了比较。实验结果表明,我们的方法在少量 ID 意图分类和近似 OOD 意图检测任务中都表现出了卓越的性能。
{"title":"Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection","authors":"Bo Liu, Liming Zhan, Yujie Feng, Zexin Lu, Chengqiang Xie, Lei Xue, Xiao-Ming Wu, Albert Y. S. Lam","doi":"arxiv-2409.11114","DOIUrl":"https://doi.org/arxiv-2409.11114","url":null,"abstract":"In the realm of task-oriented dialogue systems, a robust intent detection\u0000mechanism must effectively handle malformed utterances encountered in\u0000real-world scenarios. This study presents a novel fine-tuning framework for\u0000large language models (LLMs) aimed at enhancing in-distribution (ID) intent\u0000classification and out-of-distribution (OOD) intent detection, which utilizes\u0000semantic matching with prototypes derived from ID class names. By harnessing\u0000the highly distinguishable representations of LLMs, we construct semantic\u0000prototypes for each ID class using a diversity-grounded prompt tuning approach.\u0000We rigorously test our framework in a challenging OOD context, where ID and OOD\u0000classes are semantically close yet distinct, referred to as emph{near} OOD\u0000detection. For a thorough assessment, we benchmark our method against the\u0000prevalent fine-tuning approaches. The experimental findings reveal that our\u0000method demonstrates superior performance in both few-shot ID intent\u0000classification and near-OOD intent detection tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikit Srivastava, Denis Kuchelev, Tatiana Moteu, Kshitij Shetty, Michael Roeder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga Ngomo
This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.
{"title":"LOLA -- An Open-Source Massively Multilingual Large Language Model","authors":"Nikit Srivastava, Denis Kuchelev, Tatiana Moteu, Kshitij Shetty, Michael Roeder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga Ngomo","doi":"arxiv-2409.11272","DOIUrl":"https://doi.org/arxiv-2409.11272","url":null,"abstract":"This paper presents LOLA, a massively multilingual large language model\u0000trained on more than 160 languages using a sparse Mixture-of-Experts\u0000Transformer architecture. Our architectural and implementation choices address\u0000the challenge of harnessing linguistic diversity while maintaining efficiency\u0000and avoiding the common pitfalls of multilinguality. Our analysis of the\u0000evaluation results shows competitive performance in natural language generation\u0000and understanding tasks. Additionally, we demonstrate how the learned\u0000expert-routing mechanism exploits implicit phylogenetic linguistic patterns to\u0000potentially alleviate the curse of multilinguality. We provide an in-depth look\u0000at the training process, an analysis of the datasets, and a balanced\u0000exploration of the model's strengths and limitations. As an open-source model,\u0000LOLA promotes reproducibility and serves as a robust foundation for future\u0000research. Our findings enable the development of compute-efficient multilingual\u0000models with strong, scalable performance across languages.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam
Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes ~45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We will release the dialectal translation models and benchmarks curated in this study.
{"title":"AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs","authors":"Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam","doi":"arxiv-2409.11404","DOIUrl":"https://doi.org/arxiv-2409.11404","url":null,"abstract":"Arabic, with its rich diversity of dialects, remains significantly\u0000underrepresented in Large Language Models, particularly in dialectal\u0000variations. We address this gap by introducing seven synthetic datasets in\u0000dialects alongside Modern Standard Arabic (MSA), created using Machine\u0000Translation (MT) combined with human post-editing. We present AraDiCE, a\u0000benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on\u0000dialect comprehension and generation, focusing specifically on low-resource\u0000Arabic dialects. Additionally, we introduce the first-ever fine-grained\u0000benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and\u0000Levant regions, providing a novel dimension to LLM evaluation. Our findings\u0000demonstrate that while Arabic-specific models like Jais and AceGPT outperform\u0000multilingual models on dialectal tasks, significant challenges persist in\u0000dialect identification, generation, and translation. This work contributes ~45K\u0000post-edited samples, a cultural benchmark, and highlights the importance of\u0000tailored training to improve LLM performance in capturing the nuances of\u0000diverse Arabic dialects and cultural contexts. We will release the dialectal\u0000translation models and benchmarks curated in this study.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Priyesh Vakharia, Abigail Kufeldt, Max Meyers, Ian Lane, Leilani Gilpin
Neurosymbolic approaches can add robustness to opaque neural systems by incorporating explainable symbolic representations. However, previous approaches have not used formal logic to contextualize queries to and validate outputs of large language models (LLMs). We propose systemname{}, a novel neurosymbolic framework, to improve the robustness and reliability of LLMs in question-answering tasks. We provide systemname{} with a domain-specific knowledge base, a logical reasoning system, and an integration to an existing LLM. This framework has two capabilities (1) context gathering: generating explainable and relevant context for a given query, and (2) validation: confirming and validating the factual accuracy of a statement in accordance with a knowledge base (KB). Our work opens a new area of neurosymbolic generative AI text validation and user personalization.
{"title":"ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering","authors":"Priyesh Vakharia, Abigail Kufeldt, Max Meyers, Ian Lane, Leilani Gilpin","doi":"arxiv-2409.11589","DOIUrl":"https://doi.org/arxiv-2409.11589","url":null,"abstract":"Neurosymbolic approaches can add robustness to opaque neural systems by\u0000incorporating explainable symbolic representations. However, previous\u0000approaches have not used formal logic to contextualize queries to and validate\u0000outputs of large language models (LLMs). We propose systemname{}, a novel\u0000neurosymbolic framework, to improve the robustness and reliability of LLMs in\u0000question-answering tasks. We provide systemname{} with a domain-specific\u0000knowledge base, a logical reasoning system, and an integration to an existing\u0000LLM. This framework has two capabilities (1) context gathering: generating\u0000explainable and relevant context for a given query, and (2) validation:\u0000confirming and validating the factual accuracy of a statement in accordance\u0000with a knowledge base (KB). Our work opens a new area of neurosymbolic\u0000generative AI text validation and user personalization.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lingling Xu, Haoran Xie, S. Joe Qin, Fu Lee Wang, Xiaohui Tao
Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards specific aspect terms in a sentence and allows us to uncover nuanced perspectives and attitudes on particular aspects of a product, service, or topic. However, the scarcity of labeled data poses a significant challenge to training high-quality models. To address this issue, we explore the potential of data augmentation using ChatGPT, a well-performing large language model (LLM), to enhance the sentiment classification performance towards aspect terms. Specifically, we explore three data augmentation strategies based on ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation techniques. Context-focused data augmentation focuses on changing the word expression of context words in the sentence while keeping aspect terms unchanged. In contrast, aspect-focused data augmentation aims to change aspect terms but keep context words unchanged. Context-Aspect data augmentation integrates the above two data augmentations to generate augmented samples. Furthermore, we incorporate contrastive learning into the ABSA tasks to improve performance. Extensive experiments show that all three data augmentation techniques lead to performance improvements, with the context-aspect data augmentation strategy performing best and surpassing the performance of the baseline models.
{"title":"Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis","authors":"Lingling Xu, Haoran Xie, S. Joe Qin, Fu Lee Wang, Xiaohui Tao","doi":"arxiv-2409.11218","DOIUrl":"https://doi.org/arxiv-2409.11218","url":null,"abstract":"Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards\u0000specific aspect terms in a sentence and allows us to uncover nuanced\u0000perspectives and attitudes on particular aspects of a product, service, or\u0000topic. However, the scarcity of labeled data poses a significant challenge to\u0000training high-quality models. To address this issue, we explore the potential\u0000of data augmentation using ChatGPT, a well-performing large language model\u0000(LLM), to enhance the sentiment classification performance towards aspect\u0000terms. Specifically, we explore three data augmentation strategies based on\u0000ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation\u0000techniques. Context-focused data augmentation focuses on changing the word\u0000expression of context words in the sentence while keeping aspect terms\u0000unchanged. In contrast, aspect-focused data augmentation aims to change aspect\u0000terms but keep context words unchanged. Context-Aspect data augmentation\u0000integrates the above two data augmentations to generate augmented samples.\u0000Furthermore, we incorporate contrastive learning into the ABSA tasks to improve\u0000performance. Extensive experiments show that all three data augmentation\u0000techniques lead to performance improvements, with the context-aspect data\u0000augmentation strategy performing best and surpassing the performance of the\u0000baseline models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long text without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pairs of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection for text generations, which aligns knowledge facts and considers the dependencies between contextual knowledge triples in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual knowledge triple (facts), we construct contextual triple into a graph and enhance triples' interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long text, we conduct a LLM-based reverse verification via reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.
{"title":"Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling","authors":"Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li","doi":"arxiv-2409.11283","DOIUrl":"https://doi.org/arxiv-2409.11283","url":null,"abstract":"LLMs obtain remarkable performance but suffer from hallucinations. Most\u0000research on detecting hallucination focuses on the questions with short and\u0000concrete correct answers that are easy to check the faithfulness. Hallucination\u0000detections for text generation with open-ended answers are more challenging.\u0000Some researchers use external knowledge to detect hallucinations in generated\u0000texts, but external resources for specific scenarios are hard to access. Recent\u0000studies on detecting hallucinations in long text without external resources\u0000conduct consistency comparison among multiple sampled outputs. To handle long\u0000texts, researchers split long texts into multiple facts and individually\u0000compare the consistency of each pairs of facts. However, these methods (1)\u0000hardly achieve alignment among multiple facts; (2) overlook dependencies\u0000between multiple contextual facts. In this paper, we propose a graph-based\u0000context-aware (GCA) hallucination detection for text generations, which aligns\u0000knowledge facts and considers the dependencies between contextual knowledge\u0000triples in consistency comparison. Particularly, to align multiple facts, we\u0000conduct a triple-oriented response segmentation to extract multiple knowledge\u0000triples. To model dependencies among contextual knowledge triple (facts), we\u0000construct contextual triple into a graph and enhance triples' interactions via\u0000message passing and aggregating via RGCN. To avoid the omission of knowledge\u0000triples in long text, we conduct a LLM-based reverse verification via\u0000reconstructing the knowledge triples. Experiments show that our model enhances\u0000hallucination detection and excels all baselines.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
{"title":"Chain-of-Thought Prompting for Speech Translation","authors":"Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.11538","DOIUrl":"https://doi.org/arxiv-2409.11538","url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable advancements in\u0000language understanding and generation. Building on the success of text-based\u0000LLMs, recent research has adapted these models to use speech embeddings for\u0000prompting, resulting in Speech-LLM models that exhibit strong performance in\u0000automatic speech recognition (ASR) and automatic speech translation (AST). In\u0000this work, we propose a novel approach to leverage ASR transcripts as prompts\u0000for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM\u0000model consists of a speech encoder and an encoder-decoder structure\u0000Megatron-T5. By first decoding speech to generate ASR transcripts and\u0000subsequently using these transcripts along with encoded speech for prompting,\u0000we guide the speech translation in a two-step process like chain-of-thought\u0000(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model\u0000adaptation and shows superior performance to full model fine-tuning.\u0000Experimental results show that the proposed CoT prompting significantly\u0000improves AST performance, achieving an average increase of 2.4 BLEU points\u0000across 6 En->X or X->En AST tasks compared to speech prompting alone.\u0000Additionally, compared to a related CoT prediction method that predicts a\u0000concatenated sequence of ASR and AST transcripts, our method performs better by\u0000an average of 2 BLEU points.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Next-token prediction serves as the dominant component in current neural language models. During the training phase, the model employs teacher forcing, which predicts tokens based on all preceding ground truth tokens. However, this approach has been found to create shortcuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor. In this paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response. Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the latent semantic representations of the response, which are induced by an autoencoder. In a minimal planning task (i.e., graph path-finding), our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline models have been unable to accomplish. Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.
{"title":"Semformer: Transformer Language Models with Semantic Planning","authors":"Yongjing Yin, Junran Ding, Kai Song, Yue Zhang","doi":"arxiv-2409.11143","DOIUrl":"https://doi.org/arxiv-2409.11143","url":null,"abstract":"Next-token prediction serves as the dominant component in current neural\u0000language models. During the training phase, the model employs teacher forcing,\u0000which predicts tokens based on all preceding ground truth tokens. However, this\u0000approach has been found to create shortcuts, utilizing the revealed prefix to\u0000spuriously fit future tokens, potentially compromising the accuracy of the\u0000next-token predictor. In this paper, we introduce Semformer, a novel method of\u0000training a Transformer language model that explicitly models the semantic\u0000planning of response. Specifically, we incorporate a sequence of planning\u0000tokens into the prefix, guiding the planning token representations to predict\u0000the latent semantic representations of the response, which are induced by an\u0000autoencoder. In a minimal planning task (i.e., graph path-finding), our model\u0000exhibits near-perfect performance and effectively mitigates shortcut learning,\u0000a feat that standard training methods and baseline models have been unable to\u0000accomplish. Furthermore, we pretrain Semformer from scratch with 125M\u0000parameters, demonstrating its efficacy through measures of perplexity,\u0000in-context learning, and fine-tuning on summarization tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART Large, and compare its performance to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human evaluation where readers assess the stories generated by the SLM compared to human-written stories, and (ii) a qualitative linguistic analysis comparing the textual characteristics of the stories generated by the different models. In the first experiment, we asked 68 participants to rate short stories generated by the models and humans along dimensions such as grammaticality, relevance, creativity, and attractiveness. BART Large outperformed human writers in most aspects, except creativity, with an overall score of 2.11 compared to 1.85 for human-written texts -- a 14% improvement. In the second experiment, the qualitative analysis revealed that, while GPT-4o exhibited near-perfect internal and external coherence, it tended to produce more predictable narratives, with only 3% of its stories seen as novel. In contrast, 15% of BART's stories were considered novel, indicating a higher degree of creativity despite its smaller model size. This study provides both quantitative and qualitative insights into how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks.
在本文中,我们评估了经过精细调整的小语言模型(SLM)BART Large 的小说创作能力,并将其表现与人类和两种大语言模型(LLM)进行了比较:GPT-3.5 和 GPT-4o。评估包括两个实验:(i)人类评估,读者将 SLM 生成的故事与人类编写的故事进行比较评估;(ii)定性语言分析,比较不同模型生成的故事的文本特征。在第一个实验中,我们请 68 名参与者对模型和人类编写的短篇故事进行评分,评分标准包括语法性、相关性、创造性和吸引力。除创造性外,BART Large 在大多数方面的表现都优于人类写作者,总得分为 2.11 分,而人类写作的文本为 1.85 分,提高了 14%。在第二个实验中,定性分析显示,虽然 GPT-4o 表现出近乎完美的内部和外部一致性,但它倾向于产生更多可预测的叙事,只有 3% 的故事被认为是新颖的。这项研究从定量和定性两个方面揭示了模型大小和微调如何影响创意写作任务中创意、流畅性和一致性之间的平衡。
{"title":"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":"https://doi.org/arxiv-2409.11547","url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\u0000fine-tuned small language model (SLM), BART Large, and compare its performance\u0000to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\u0000evaluation consists of two experiments: (i) a human evaluation where readers\u0000assess the stories generated by the SLM compared to human-written stories, and\u0000(ii) a qualitative linguistic analysis comparing the textual characteristics of\u0000the stories generated by the different models. In the first experiment, we\u0000asked 68 participants to rate short stories generated by the models and humans\u0000along dimensions such as grammaticality, relevance, creativity, and\u0000attractiveness. BART Large outperformed human writers in most aspects, except\u0000creativity, with an overall score of 2.11 compared to 1.85 for human-written\u0000texts -- a 14% improvement. In the second experiment, the qualitative analysis\u0000revealed that, while GPT-4o exhibited near-perfect internal and external\u0000coherence, it tended to produce more predictable narratives, with only 3% of\u0000its stories seen as novel. In contrast, 15% of BART's stories were considered\u0000novel, indicating a higher degree of creativity despite its smaller model size.\u0000This study provides both quantitative and qualitative insights into how model\u0000size and fine-tuning influence the balance between creativity, fluency, and\u0000coherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}