arXiv - CS - Computation and Language最新文献_第5页

Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models 评估压缩技术对大型语言模型特定任务性能的影响

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11233

Bishwash Khanal, Jeffery M. Capone

Large language models (LLMs) offer powerful capabilities but incursubstantial computational costs, driving the need for efficient compressiontechniques. This study evaluates the impact of popular compression methods -Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing onthe trade-offs between model size reduction, downstream task performance, andthe role of calibration data. Our findings reveal that while SparseGPT andWanda preserve perplexity even at 50% sparsity, they suffer significantdegradation on downstream tasks, highlighting the inadequacy of perplexity asthe sole evaluation metric. To address this, we introduce Jensen-Shannon (JS)Divergence as a more comprehensive metric that captures nuanced changes inmodel behavior post-compression. We further demonstrate that task-specificcalibration data significantly enhances the downstream performance ofcompressed models compared to general calibration data. This researchunderscores the necessity for diverse evaluation metrics and carefulcalibration data selection to fully understand the complexities of LLMcompression and its implications for practical applications.

大型语言模型（LLM）功能强大，但计算成本高昂，因此需要高效的压缩技术。本研究评估了流行的压缩方法--Magnitude Pruning、SparseGPT 和 Wanda--对 LLaMA-2-7B 模型的影响，重点关注模型大小缩减、下游任务性能和校准数据作用之间的权衡。我们的研究结果表明，虽然 SparseGPT 和 Wanda 在稀疏度为 50% 时仍能保持可知度，但它们在下游任务中的表现却大不如前，这凸显了可知度作为唯一评价指标的不足。为了解决这个问题，我们引入了詹森-香农（Jensen-Shannon，JS）发散度作为更全面的指标，以捕捉压缩后模型行为的细微变化。我们进一步证明，与一般校准数据相比，特定任务校准数据能显著提高压缩模型的下游性能。这项研究证明，要充分了解 LLM 压缩的复杂性及其对实际应用的影响，就必须采用不同的评估指标，并谨慎选择校准数据。

{"title":"Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models","authors":"Bishwash Khanal, Jeffery M. Capone","doi":"arxiv-2409.11233","DOIUrl":"https://doi.org/arxiv-2409.11233","url":null,"abstract":"Large language models (LLMs) offer powerful capabilities but incur\u0000substantial computational costs, driving the need for efficient compression\u0000techniques. This study evaluates the impact of popular compression methods -\u0000Magnitude Pruning, SparseGPT, and Wanda - on the LLaMA-2-7B model, focusing on\u0000the trade-offs between model size reduction, downstream task performance, and\u0000the role of calibration data. Our findings reveal that while SparseGPT and\u0000Wanda preserve perplexity even at 50% sparsity, they suffer significant\u0000degradation on downstream tasks, highlighting the inadequacy of perplexity as\u0000the sole evaluation metric. To address this, we introduce Jensen-Shannon (JS)\u0000Divergence as a more comprehensive metric that captures nuanced changes in\u0000model behavior post-compression. We further demonstrate that task-specific\u0000calibration data significantly enhances the downstream performance of\u0000compressed models compared to general calibration data. This research\u0000underscores the necessity for diverse evaluation metrics and careful\u0000calibration data selection to fully understand the complexities of LLM\u0000compression and its implications for practical applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diversity-grounded Channel Prototypical Learning for Out-of-Distribution Intent Detection 用于分布外意图检测的基于多样性的信道原型学习

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11114

Bo Liu, Liming Zhan, Yujie Feng, Zexin Lu, Chengqiang Xie, Lei Xue, Xiao-Ming Wu, Albert Y. S. Lam

In the realm of task-oriented dialogue systems, a robust intent detectionmechanism must effectively handle malformed utterances encountered inreal-world scenarios. This study presents a novel fine-tuning framework forlarge language models (LLMs) aimed at enhancing in-distribution (ID) intentclassification and out-of-distribution (OOD) intent detection, which utilizessemantic matching with prototypes derived from ID class names. By harnessingthe highly distinguishable representations of LLMs, we construct semanticprototypes for each ID class using a diversity-grounded prompt tuning approach.We rigorously test our framework in a challenging OOD context, where ID and OODclasses are semantically close yet distinct, referred to as emph{near} OODdetection. For a thorough assessment, we benchmark our method against theprevalent fine-tuning approaches. The experimental findings reveal that ourmethod demonstrates superior performance in both few-shot ID intentclassification and near-OOD intent detection tasks.

在面向任务的对话系统领域，强大的意图检测机制必须能有效处理真实世界场景中遇到的畸形语句。本研究为大语言模型（LLMs）提出了一个新颖的微调框架，旨在增强分布内（ID）意图分类和分布外（OOD）意图检测，该框架利用了从 ID 类名衍生出的原型进行语义匹配。我们在具有挑战性的 OOD 环境中对我们的框架进行了严格测试，在这种环境中，ID 和 OOD 类别在语义上非常接近，但又截然不同，这被称为 "接近 "OOD 检测。OODdetection.为了进行全面评估，我们将我们的方法与流行的微调方法进行了比较。实验结果表明，我们的方法在少量 ID 意图分类和近似 OOD 意图检测任务中都表现出了卓越的性能。

引用次数: 0

LOLA -- An Open-Source Massively Multilingual Large Language Model LOLA -- 一种开源的大规模多语种大语言模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11272

Nikit Srivastava, Denis Kuchelev, Tatiana Moteu, Kshitij Shetty, Michael Roeder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga Ngomo

This paper presents LOLA, a massively multilingual large language modeltrained on more than 160 languages using a sparse Mixture-of-ExpertsTransformer architecture. Our architectural and implementation choices addressthe challenge of harnessing linguistic diversity while maintaining efficiencyand avoiding the common pitfalls of multilinguality. Our analysis of theevaluation results shows competitive performance in natural language generationand understanding tasks. Additionally, we demonstrate how the learnedexpert-routing mechanism exploits implicit phylogenetic linguistic patterns topotentially alleviate the curse of multilinguality. We provide an in-depth lookat the training process, an analysis of the datasets, and a balancedexploration of the model's strengths and limitations. As an open-source model,LOLA promotes reproducibility and serves as a robust foundation for futureresearch. Our findings enable the development of compute-efficient multilingualmodels with strong, scalable performance across languages.

本文介绍的 LOLA 是一种大规模多语言大型语言模型，它使用稀疏专家混合转换器架构在 160 多种语言上进行训练。我们在架构和实现方面的选择解决了在保持效率的同时利用语言多样性的难题，并避免了多语言性的常见缺陷。对评估结果的分析表明，我们在自然语言生成和理解任务中的表现极具竞争力。此外，我们还展示了所学的外显路由机制是如何利用隐含的系统发育语言模式来缓解多语言性诅咒的。我们深入探讨了训练过程，分析了数据集，并对模型的优势和局限性进行了均衡的探讨。作为一个开源模型，LOLA 促进了可重复性，并为未来研究奠定了坚实的基础。我们的研究成果有助于开发计算效率高的多语言模型，这些模型在不同语言间具有强大的可扩展性能。

{"title":"LOLA -- An Open-Source Massively Multilingual Large Language Model","authors":"Nikit Srivastava, Denis Kuchelev, Tatiana Moteu, Kshitij Shetty, Michael Roeder, Diego Moussallem, Hamada Zahera, Axel-Cyrille Ngonga Ngomo","doi":"arxiv-2409.11272","DOIUrl":"https://doi.org/arxiv-2409.11272","url":null,"abstract":"This paper presents LOLA, a massively multilingual large language model\u0000trained on more than 160 languages using a sparse Mixture-of-Experts\u0000Transformer architecture. Our architectural and implementation choices address\u0000the challenge of harnessing linguistic diversity while maintaining efficiency\u0000and avoiding the common pitfalls of multilinguality. Our analysis of the\u0000evaluation results shows competitive performance in natural language generation\u0000and understanding tasks. Additionally, we demonstrate how the learned\u0000expert-routing mechanism exploits implicit phylogenetic linguistic patterns to\u0000potentially alleviate the curse of multilinguality. We provide an in-depth look\u0000at the training process, an analysis of the datasets, and a balanced\u0000exploration of the model's strengths and limitations. As an open-source model,\u0000LOLA promotes reproducibility and serves as a robust foundation for future\u0000research. Our findings enable the development of compute-efficient multilingual\u0000models with strong, scalable performance across languages.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs AraDiCE：法律硕士的方言和文化能力基准

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11404

Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam

Arabic, with its rich diversity of dialects, remains significantlyunderrepresented in Large Language Models, particularly in dialectalvariations. We address this gap by introducing seven synthetic datasets indialects alongside Modern Standard Arabic (MSA), created using MachineTranslation (MT) combined with human post-editing. We present AraDiCE, abenchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs ondialect comprehension and generation, focusing specifically on low-resourceArabic dialects. Additionally, we introduce the first-ever fine-grainedbenchmark designed to evaluate cultural awareness across the Gulf, Egypt, andLevant regions, providing a novel dimension to LLM evaluation. Our findingsdemonstrate that while Arabic-specific models like Jais and AceGPT outperformmultilingual models on dialectal tasks, significant challenges persist indialect identification, generation, and translation. This work contributes ~45Kpost-edited samples, a cultural benchmark, and highlights the importance oftailored training to improve LLM performance in capturing the nuances ofdiverse Arabic dialects and cultural contexts. We will release the dialectaltranslation models and benchmarks curated in this study.

阿拉伯语具有丰富的方言多样性，但在大型语言模型中，特别是在方言变化方面，其代表性仍然明显不足。为了填补这一空白，我们引入了七个合成数据集，这些数据集是使用机器翻译（MT）结合人工后期编辑创建的，与现代标准阿拉伯语（MSA）并列。我们提出了阿拉伯语方言和文化评估基准 AraDiCE。我们对 LLM 的方言理解和生成进行了评估，特别关注低资源阿拉伯语方言。此外，我们还首次推出了细粒度基准，用于评估海湾、埃及和莱万特地区的文化意识，为 LLM 评估提供了一个新的维度。我们的研究结果表明，虽然 Jais 和 AceGPT 等阿拉伯语特定模型在方言任务上优于多语言模型，但在方言识别、生成和翻译方面仍然存在重大挑战。这项工作提供了约 45K 个经过编辑的样本，这是一个文化基准，并强调了有针对性的训练对于提高 LLM 在捕捉不同阿拉伯语方言和文化背景的细微差别方面的性能的重要性。我们将发布本研究中策划的方言翻译模型和基准。

{"title":"AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs","authors":"Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam","doi":"arxiv-2409.11404","DOIUrl":"https://doi.org/arxiv-2409.11404","url":null,"abstract":"Arabic, with its rich diversity of dialects, remains significantly\u0000underrepresented in Large Language Models, particularly in dialectal\u0000variations. We address this gap by introducing seven synthetic datasets in\u0000dialects alongside Modern Standard Arabic (MSA), created using Machine\u0000Translation (MT) combined with human post-editing. We present AraDiCE, a\u0000benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on\u0000dialect comprehension and generation, focusing specifically on low-resource\u0000Arabic dialects. Additionally, we introduce the first-ever fine-grained\u0000benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and\u0000Levant regions, providing a novel dimension to LLM evaluation. Our findings\u0000demonstrate that while Arabic-specific models like Jais and AceGPT outperform\u0000multilingual models on dialectal tasks, significant challenges persist in\u0000dialect identification, generation, and translation. This work contributes ~45K\u0000post-edited samples, a cultural benchmark, and highlights the importance of\u0000tailored training to improve LLM performance in capturing the nuances of\u0000diverse Arabic dialects and cultural contexts. We will release the dialectal\u0000translation models and benchmarks curated in this study.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering ProSLM : 用于可解释的特定领域知识型问题解答的 Prolog 协同语言模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11589

Priyesh Vakharia, Abigail Kufeldt, Max Meyers, Ian Lane, Leilani Gilpin

Neurosymbolic approaches can add robustness to opaque neural systems byincorporating explainable symbolic representations. However, previousapproaches have not used formal logic to contextualize queries to and validateoutputs of large language models (LLMs). We propose systemname{}, a novelneurosymbolic framework, to improve the robustness and reliability of LLMs inquestion-answering tasks. We provide systemname{} with a domain-specificknowledge base, a logical reasoning system, and an integration to an existingLLM. This framework has two capabilities (1) context gathering: generatingexplainable and relevant context for a given query, and (2) validation:confirming and validating the factual accuracy of a statement in accordancewith a knowledge base (KB). Our work opens a new area of neurosymbolicgenerative AI text validation and user personalization.

神经符号方法可以通过纳入可解释的符号表示，为不透明的神经系统增加鲁棒性。然而，以前的方法并没有使用形式逻辑来对大型语言模型（LLM）的查询和输出进行语境化验证。我们提出了一个新颖的神经符号框架--systemname{}，以提高大型语言模型在问题解答任务中的稳健性和可靠性。我们为systemname{}提供了一个特定领域的知识库、一个逻辑推理系统和一个与现有LLM的集成。该框架有两个功能：（1）上下文收集：为给定查询生成可解释的相关上下文；（2）验证：根据知识库（KB）确认和验证语句的事实准确性。我们的工作开辟了神经符号生成人工智能文本验证和用户个性化的新领域。

引用次数: 0

Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis 为基于对比方面的情感分析探索基于 ChatGPT 的增强策略

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11218

Lingling Xu, Haoran Xie, S. Joe Qin, Fu Lee Wang, Xiaohui Tao

Aspect-based sentiment analysis (ABSA) involves identifying sentiment towardsspecific aspect terms in a sentence and allows us to uncover nuancedperspectives and attitudes on particular aspects of a product, service, ortopic. However, the scarcity of labeled data poses a significant challenge totraining high-quality models. To address this issue, we explore the potentialof data augmentation using ChatGPT, a well-performing large language model(LLM), to enhance the sentiment classification performance towards aspectterms. Specifically, we explore three data augmentation strategies based onChatGPT: context-focused, aspect-focused, and context-aspect data augmentationtechniques. Context-focused data augmentation focuses on changing the wordexpression of context words in the sentence while keeping aspect termsunchanged. In contrast, aspect-focused data augmentation aims to change aspectterms but keep context words unchanged. Context-Aspect data augmentationintegrates the above two data augmentations to generate augmented samples.Furthermore, we incorporate contrastive learning into the ABSA tasks to improveperformance. Extensive experiments show that all three data augmentationtechniques lead to performance improvements, with the context-aspect dataaugmentation strategy performing best and surpassing the performance of thebaseline models.

基于方面的情感分析（ABSA）涉及识别句子中特定方面术语的情感，使我们能够发现对产品、服务或主题特定方面的细微观点和态度。然而，标注数据的匮乏给高质量模型的训练带来了巨大挑战。为了解决这个问题，我们探索了使用 ChatGPT（一种性能良好的大型语言模型（LLM））进行数据增强的潜力，以提高对方面术语的情感分类性能。具体来说，我们在 ChatGPT 的基础上探索了三种数据增强策略：以上下文为重点的数据增强技术、以方面为重点的数据增强技术和以上下文为重点的数据增强技术。以上下文为重点的数据增强侧重于改变句子中上下文词语的表达方式，同时保持方面词不变。与此相反，以方面为重点的数据增强旨在改变方面词，但保持上下文词不变。此外，我们还在 ABSA 任务中加入了对比学习以提高性能。广泛的实验表明，这三种数据增强技术都能提高性能，其中上下文方面数据增强策略的性能最好，超过了基准模型的性能。

{"title":"Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis","authors":"Lingling Xu, Haoran Xie, S. Joe Qin, Fu Lee Wang, Xiaohui Tao","doi":"arxiv-2409.11218","DOIUrl":"https://doi.org/arxiv-2409.11218","url":null,"abstract":"Aspect-based sentiment analysis (ABSA) involves identifying sentiment towards\u0000specific aspect terms in a sentence and allows us to uncover nuanced\u0000perspectives and attitudes on particular aspects of a product, service, or\u0000topic. However, the scarcity of labeled data poses a significant challenge to\u0000training high-quality models. To address this issue, we explore the potential\u0000of data augmentation using ChatGPT, a well-performing large language model\u0000(LLM), to enhance the sentiment classification performance towards aspect\u0000terms. Specifically, we explore three data augmentation strategies based on\u0000ChatGPT: context-focused, aspect-focused, and context-aspect data augmentation\u0000techniques. Context-focused data augmentation focuses on changing the word\u0000expression of context words in the sentence while keeping aspect terms\u0000unchanged. In contrast, aspect-focused data augmentation aims to change aspect\u0000terms but keep context words unchanged. Context-Aspect data augmentation\u0000integrates the above two data augmentations to generate augmented samples.\u0000Furthermore, we incorporate contrastive learning into the ABSA tasks to improve\u0000performance. Extensive experiments show that all three data augmentation\u0000techniques lead to performance improvements, with the context-aspect data\u0000augmentation strategy performing best and surpassing the performance of the\u0000baseline models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling 通过基于图的上下文知识三元组建模检测文本生成中的零资源幻觉

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11283

Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li

LLMs obtain remarkable performance but suffer from hallucinations. Mostresearch on detecting hallucination focuses on the questions with short andconcrete correct answers that are easy to check the faithfulness. Hallucinationdetections for text generation with open-ended answers are more challenging.Some researchers use external knowledge to detect hallucinations in generatedtexts, but external resources for specific scenarios are hard to access. Recentstudies on detecting hallucinations in long text without external resourcesconduct consistency comparison among multiple sampled outputs. To handle longtexts, researchers split long texts into multiple facts and individuallycompare the consistency of each pairs of facts. However, these methods (1)hardly achieve alignment among multiple facts; (2) overlook dependenciesbetween multiple contextual facts. In this paper, we propose a graph-basedcontext-aware (GCA) hallucination detection for text generations, which alignsknowledge facts and considers the dependencies between contextual knowledgetriples in consistency comparison. Particularly, to align multiple facts, weconduct a triple-oriented response segmentation to extract multiple knowledgetriples. To model dependencies among contextual knowledge triple (facts), weconstruct contextual triple into a graph and enhance triples' interactions viamessage passing and aggregating via RGCN. To avoid the omission of knowledgetriples in long text, we conduct a LLM-based reverse verification viareconstructing the knowledge triples. Experiments show that our model enhanceshallucination detection and excels all baselines.

LLMs 性能卓越，但也存在幻觉。大多数关于幻觉检测的研究都集中在具有简短而具体的正确答案的问题上，这样的问题很容易检测其忠实性。一些研究人员使用外部知识来检测生成文本中的幻觉，但很难获取特定场景的外部资源。最近关于在没有外部资源的情况下检测长文本中幻觉的研究在多个采样输出中进行了一致性比较。为了处理长文本，研究人员将长文本分割成多个事实，并单独比较每对事实的一致性。然而，这些方法（1）很难实现多个事实之间的一致性；（2）忽略了多个上下文事实之间的依赖关系。本文提出了一种基于图的上下文感知（GCA）的文本生成幻觉检测方法，该方法对齐知识事实，并在一致性比较中考虑上下文知识三元组之间的依赖关系。特别是，为了对齐多个事实，我们进行了面向三重的响应分割，以提取多个知识要素。为了模拟上下文知识三元（事实）之间的依赖关系，我们将上下文三元构建成图，并通过 RGCN 进行信息传递和聚合，增强三元之间的交互。为了避免长文本中知识三元组的遗漏，我们通过重新构建知识三元组来进行基于 LLM 的反向验证。实验表明，我们的模型增强了幻觉检测能力，并优于所有基线模型。

{"title":"Zero-resource Hallucination Detection for Text Generation via Graph-based Contextual Knowledge Triples Modeling","authors":"Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li","doi":"arxiv-2409.11283","DOIUrl":"https://doi.org/arxiv-2409.11283","url":null,"abstract":"LLMs obtain remarkable performance but suffer from hallucinations. Most\u0000research on detecting hallucination focuses on the questions with short and\u0000concrete correct answers that are easy to check the faithfulness. Hallucination\u0000detections for text generation with open-ended answers are more challenging.\u0000Some researchers use external knowledge to detect hallucinations in generated\u0000texts, but external resources for specific scenarios are hard to access. Recent\u0000studies on detecting hallucinations in long text without external resources\u0000conduct consistency comparison among multiple sampled outputs. To handle long\u0000texts, researchers split long texts into multiple facts and individually\u0000compare the consistency of each pairs of facts. However, these methods (1)\u0000hardly achieve alignment among multiple facts; (2) overlook dependencies\u0000between multiple contextual facts. In this paper, we propose a graph-based\u0000context-aware (GCA) hallucination detection for text generations, which aligns\u0000knowledge facts and considers the dependencies between contextual knowledge\u0000triples in consistency comparison. Particularly, to align multiple facts, we\u0000conduct a triple-oriented response segmentation to extract multiple knowledge\u0000triples. To model dependencies among contextual knowledge triple (facts), we\u0000construct contextual triple into a graph and enhance triples' interactions via\u0000message passing and aggregating via RGCN. To avoid the omission of knowledge\u0000triples in long text, we conduct a LLM-based reverse verification via\u0000reconstructing the knowledge triples. Experiments show that our model enhances\u0000hallucination detection and excels all baselines.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Chain-of-Thought Prompting for Speech Translation 语音翻译的思维链提示

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11538

Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Large language models (LLMs) have demonstrated remarkable advancements inlanguage understanding and generation. Building on the success of text-basedLLMs, recent research has adapted these models to use speech embeddings forprompting, resulting in Speech-LLM models that exhibit strong performance inautomatic speech recognition (ASR) and automatic speech translation (AST). Inthis work, we propose a novel approach to leverage ASR transcripts as promptsfor AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLMmodel consists of a speech encoder and an encoder-decoder structureMegatron-T5. By first decoding speech to generate ASR transcripts andsubsequently using these transcripts along with encoded speech for prompting,we guide the speech translation in a two-step process like chain-of-thought(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for modeladaptation and shows superior performance to full model fine-tuning.Experimental results show that the proposed CoT prompting significantlyimproves AST performance, achieving an average increase of 2.4 BLEU pointsacross 6 En->X or X->En AST tasks compared to speech prompting alone.Additionally, compared to a related CoT prediction method that predicts aconcatenated sequence of ASR and AST transcripts, our method performs better byan average of 2 BLEU points.

大语言模型（LLM）在语言理解和生成方面取得了显著进步。在基于文本的大型语言模型取得成功的基础上，最近的研究将这些模型调整为使用语音嵌入进行提示，从而产生了在自动语音识别（ASR）和自动语音翻译（AST）中表现出色的语音大型语言模型。在这项工作中，我们提出了一种新方法，在基于编码器-解码器文本 LLM 的 Speech-LLM 中利用 ASR 转录作为 AST 的提示。语音 LLM 模型由一个语音编码器和一个编码器-解码器结构（Megatron-T5）组成。我们首先对语音进行解码，生成 ASR 转录本，然后使用这些转录本和编码语音进行提示，通过类似于思维链（CoT）提示的两步过程引导语音翻译。实验结果表明，建议的 CoT 提示显著提高了 AST 性能，与单独的语音提示相比，在 6 个 En->X 或 X->En AST 任务中平均提高了 2.4 个 BLEU 点。此外，与预测 ASR 和 AST 转录本合并序列的相关 CoT 预测方法相比，我们的方法平均提高了 2 个 BLEU 点。

{"title":"Chain-of-Thought Prompting for Speech Translation","authors":"Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.11538","DOIUrl":"https://doi.org/arxiv-2409.11538","url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable advancements in\u0000language understanding and generation. Building on the success of text-based\u0000LLMs, recent research has adapted these models to use speech embeddings for\u0000prompting, resulting in Speech-LLM models that exhibit strong performance in\u0000automatic speech recognition (ASR) and automatic speech translation (AST). In\u0000this work, we propose a novel approach to leverage ASR transcripts as prompts\u0000for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM\u0000model consists of a speech encoder and an encoder-decoder structure\u0000Megatron-T5. By first decoding speech to generate ASR transcripts and\u0000subsequently using these transcripts along with encoded speech for prompting,\u0000we guide the speech translation in a two-step process like chain-of-thought\u0000(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model\u0000adaptation and shows superior performance to full model fine-tuning.\u0000Experimental results show that the proposed CoT prompting significantly\u0000improves AST performance, achieving an average increase of 2.4 BLEU points\u0000across 6 En->X or X->En AST tasks compared to speech prompting alone.\u0000Additionally, compared to a related CoT prediction method that predicts a\u0000concatenated sequence of ASR and AST transcripts, our method performs better by\u0000an average of 2 BLEU points.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semformer: Transformer Language Models with Semantic Planning Semformer：具有语义规划功能的转换器语言模型

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11143

Yongjing Yin, Junran Ding, Kai Song, Yue Zhang

Next-token prediction serves as the dominant component in current neurallanguage models. During the training phase, the model employs teacher forcing,which predicts tokens based on all preceding ground truth tokens. However, thisapproach has been found to create shortcuts, utilizing the revealed prefix tospuriously fit future tokens, potentially compromising the accuracy of thenext-token predictor. In this paper, we introduce Semformer, a novel method oftraining a Transformer language model that explicitly models the semanticplanning of response. Specifically, we incorporate a sequence of planningtokens into the prefix, guiding the planning token representations to predictthe latent semantic representations of the response, which are induced by anautoencoder. In a minimal planning task (i.e., graph path-finding), our modelexhibits near-perfect performance and effectively mitigates shortcut learning,a feat that standard training methods and baseline models have been unable toaccomplish. Furthermore, we pretrain Semformer from scratch with 125Mparameters, demonstrating its efficacy through measures of perplexity,in-context learning, and fine-tuning on summarization tasks.

下一个标记预测是当前神经语言模型的主要组成部分。在训练阶段，模型采用教师强制法，即根据前面所有的地面实况标记来预测标记。然而，人们发现这种方法会产生捷径，利用揭示的前缀来错误地拟合未来的标记，从而可能影响下一标记预测器的准确性。在本文中，我们介绍了 Semformer，这是一种训练 Transformer 语言模型的新方法，它可以明确地模拟响应的语义规划。具体来说，我们在前缀中加入了一系列规划标记，引导规划标记表征预测反应的潜在语义表征，这些潜在语义表征是由自动编码器诱导的。在最小规划任务（即图路径查找）中，我们的模式lex 表现出近乎完美的性能，并有效地减少了捷径学习，而这正是标准训练方法和基线模型所无法实现的。此外，我们还用 1.25 亿个参数对 Semformer 进行了从头开始的预训练，通过在摘要任务中的困惑度测量、上下文学习和微调，证明了它的功效。

{"title":"Semformer: Transformer Language Models with Semantic Planning","authors":"Yongjing Yin, Junran Ding, Kai Song, Yue Zhang","doi":"arxiv-2409.11143","DOIUrl":"https://doi.org/arxiv-2409.11143","url":null,"abstract":"Next-token prediction serves as the dominant component in current neural\u0000language models. During the training phase, the model employs teacher forcing,\u0000which predicts tokens based on all preceding ground truth tokens. However, this\u0000approach has been found to create shortcuts, utilizing the revealed prefix to\u0000spuriously fit future tokens, potentially compromising the accuracy of the\u0000next-token predictor. In this paper, we introduce Semformer, a novel method of\u0000training a Transformer language model that explicitly models the semantic\u0000planning of response. Specifically, we incorporate a sequence of planning\u0000tokens into the prefix, guiding the planning token representations to predict\u0000the latent semantic representations of the response, which are induced by an\u0000autoencoder. In a minimal planning task (i.e., graph path-finding), our model\u0000exhibits near-perfect performance and effectively mitigates shortcut learning,\u0000a feat that standard training methods and baseline models have been unable to\u0000accomplish. Furthermore, we pretrain Semformer from scratch with 125M\u0000parameters, demonstrating its efficacy through measures of perplexity,\u0000in-context learning, and fine-tuning on summarization tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving the Efficiency of Visually Augmented Language Models 提高视觉增强语言模型的效率

arXiv - CS - Computation and Language

Pub Date : 2024-09-17 DOI: arxiv-2409.11148

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

Despite the impressive performance of autoregressive Language Models (LM) ithas been shown that due to reporting bias, LMs lack visual knowledge, i.e. theydo not know much about the visual world and its properties. To augment LMs withvisual knowledge, existing solutions often rely on explicit images, requiringtime-consuming retrieval or image generation systems. This paper shows thatexplicit images are not necessary to visually augment an LM. Instead, we usevisually-grounded text representations obtained from the well-known CLIPmultimodal system. For a fair comparison, we modify VALM, a visually-augmentedLM which uses image retrieval and representation, to work directly withvisually-grounded text representations. We name this new model BLIND-VALM. Weshow that BLIND-VALM performs on par with VALM for Visual LanguageUnderstanding (VLU), Natural Language Understanding (NLU) and Language Modelingtasks, despite being significantly more efficient and simpler. We also showthat scaling up our model within the compute budget of VALM, either increasingthe model or pre-training corpus size, we outperform VALM for all theevaluation tasks.

尽管自回归语言模型（LM）的性能令人印象深刻，但研究表明，由于报告偏差，LM 缺乏视觉知识，也就是说，它们对视觉世界及其属性知之甚少。为了用视觉知识增强 LM，现有的解决方案通常依赖于显式图像，这需要耗时的检索或图像生成系统。本文展示了视觉增强 LM 不需要显式图像。相反，我们使用了从著名的 CLIP 多模态系统中获得的视觉基础文本表征。为了进行公平比较，我们修改了视觉增强 LM（使用图像检索和表示）VALM，使其直接使用视觉基础文本表示。我们将这个新模型命名为 BLIND-VALM。我们发现，BLIND-VALM 在视觉语言理解（VLU）、自然语言理解（NLU）和语言建模任务方面的表现与 VALM 不相上下，而且效率更高、更简单。我们还表明，在 VALM 的计算预算范围内扩展我们的模型，无论是增加模型还是增加预训练语料库规模，我们在所有评估任务中的表现都优于 VALM。

{"title":"Improving the Efficiency of Visually Augmented Language Models","authors":"Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune","doi":"arxiv-2409.11148","DOIUrl":"https://doi.org/arxiv-2409.11148","url":null,"abstract":"Despite the impressive performance of autoregressive Language Models (LM) it\u0000has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they\u0000do not know much about the visual world and its properties. To augment LMs with\u0000visual knowledge, existing solutions often rely on explicit images, requiring\u0000time-consuming retrieval or image generation systems. This paper shows that\u0000explicit images are not necessary to visually augment an LM. Instead, we use\u0000visually-grounded text representations obtained from the well-known CLIP\u0000multimodal system. For a fair comparison, we modify VALM, a visually-augmented\u0000LM which uses image retrieval and representation, to work directly with\u0000visually-grounded text representations. We name this new model BLIND-VALM. We\u0000show that BLIND-VALM performs on par with VALM for Visual Language\u0000Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling\u0000tasks, despite being significantly more efficient and simpler. We also show\u0000that scaling up our model within the compute budget of VALM, either increasing\u0000the model or pre-training corpus size, we outperform VALM for all the\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0