Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
{"title":"A Controlled Study on Long Context Extension and Generalization in LLMs","authors":"Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush","doi":"arxiv-2409.12181","DOIUrl":"https://doi.org/arxiv-2409.12181","url":null,"abstract":"Broad textual understanding and in-context learning require language models\u0000that utilize full document contexts. Due to the implementation challenges\u0000associated with directly training long-context models, many methods have been\u0000proposed for extending models to handle long contexts. However, owing to\u0000differences in data and model classes, it has been challenging to compare these\u0000approaches, leading to uncertainty as to how to evaluate long-context\u0000performance and whether it differs from standard evaluation. We implement a\u0000controlled protocol for extension methods with a standardized evaluation,\u0000utilizing consistent base models and extension data. Our study yields several\u0000insights into long-context behavior. First, we reaffirm the critical role of\u0000perplexity as a general-purpose performance indicator even in longer-context\u0000tasks. Second, we find that current approximate attention methods\u0000systematically underperform across long-context tasks. Finally, we confirm that\u0000exact fine-tuning based methods are generally effective within the range of\u0000their extension, whereas extrapolation remains challenging. All codebases,\u0000models, and checkpoints will be made available open-source, promoting\u0000transparency and facilitating further research in this critical area of AI\u0000development.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng
While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO's competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
{"title":"You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL","authors":"Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng","doi":"arxiv-2409.12172","DOIUrl":"https://doi.org/arxiv-2409.12172","url":null,"abstract":"While significant progress has been made on the text-to-SQL task, recent\u0000solutions repeatedly encode the same database schema for every question,\u0000resulting in unnecessary high inference cost and often overlooking crucial\u0000database knowledge. To address these issues, we propose You Only Read Once\u0000(YORO), a novel paradigm that directly internalizes database knowledge into the\u0000parametric knowledge of a text-to-SQL model during training and eliminates the\u0000need for schema encoding during inference. YORO significantly reduces the input\u0000token length by 66%-98%. Despite its shorter inputs, our empirical results\u0000demonstrate YORO's competitive performances with traditional systems on three\u0000benchmarks as well as its significant outperformance on large databases.\u0000Furthermore, YORO excels in handling questions with challenging value\u0000retrievals such as abbreviation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yumeng Yang, Peter Krusche, Kristyn Pantoja, Cheng Shi, Ethan Ludmir, Kirk Roberts, Gen Zhu
Tables, figures, and listings (TFLs) are essential tools for summarizing clinical trial data. Creation of TFLs for reporting activities is often a time-consuming task encountered routinely during the execution of clinical trials. This study explored the use of large language models (LLMs) to automate the generation of TFLs through prompt engineering and few-shot transfer learning. Using public clinical trial data in ADaM format, our results demonstrated that LLMs can efficiently generate TFLs with prompt instructions, showcasing their potential in this domain. Furthermore, we developed a conservational agent named Clinical Trial TFL Generation Agent: An app that matches user queries to predefined prompts that produce customized programs to generate specific predefined TFLs.
{"title":"Using Large Language Models to Generate Clinical Trial Tables and Figures","authors":"Yumeng Yang, Peter Krusche, Kristyn Pantoja, Cheng Shi, Ethan Ludmir, Kirk Roberts, Gen Zhu","doi":"arxiv-2409.12046","DOIUrl":"https://doi.org/arxiv-2409.12046","url":null,"abstract":"Tables, figures, and listings (TFLs) are essential tools for summarizing\u0000clinical trial data. Creation of TFLs for reporting activities is often a\u0000time-consuming task encountered routinely during the execution of clinical\u0000trials. This study explored the use of large language models (LLMs) to automate\u0000the generation of TFLs through prompt engineering and few-shot transfer\u0000learning. Using public clinical trial data in ADaM format, our results\u0000demonstrated that LLMs can efficiently generate TFLs with prompt instructions,\u0000showcasing their potential in this domain. Furthermore, we developed a\u0000conservational agent named Clinical Trial TFL Generation Agent: An app that\u0000matches user queries to predefined prompts that produce customized programs to\u0000generate specific predefined TFLs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.
{"title":"From Lists to Emojis: How Format Bias Affects Model Alignment","authors":"Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang","doi":"arxiv-2409.11704","DOIUrl":"https://doi.org/arxiv-2409.11704","url":null,"abstract":"In this paper, we study format biases in reinforcement learning from human\u0000feedback (RLHF). We observe that many widely-used preference models, including\u0000human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark,\u0000exhibit strong biases towards specific format patterns, such as lists, links,\u0000bold text, and emojis. Furthermore, large language models (LLMs) can exploit\u0000these biases to achieve higher rankings on popular benchmarks like AlpacaEval\u0000and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where\u0000current preference models favor longer responses that appear more\u0000comprehensive, even when their quality is equal to or lower than shorter,\u0000competing responses. However, format biases beyond verbosity remain largely\u0000underexplored in the literature. In this work, we extend the study of biases in\u0000preference learning beyond the commonly recognized length bias, offering a\u0000comprehensive analysis of a wider range of format biases. Additionally, we show\u0000that with a small amount of biased data (less than 1%), we can inject\u0000significant bias into the reward model. Moreover, these format biases can also\u0000be easily exploited by downstream alignment algorithms, such as best-of-n\u0000sampling and online iterative DPO, as it is usually easier to manipulate the\u0000format than to improve the quality of responses. Our findings emphasize the\u0000need to disentangle format and content both for designing alignment algorithms\u0000and evaluating models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiale Wang, Junhui Yu, Huanyong Liu, Chenanran Kong
Hierarchical and complex Mathematical Expression Recognition (MER) is challenging due to multiple possible interpretations of a formula, complicating both parsing and evaluation. In this paper, we introduce the Hierarchical Detail-Focused Recognition dataset (HDR), the first dataset specifically designed to address these issues. It consists of a large-scale training set, HDR-100M, offering an unprecedented scale and diversity with one hundred million training instances. And the test set, HDR-Test, includes multiple interpretations of complex hierarchical formulas for comprehensive model performance evaluation. Additionally, the parsing of complex formulas often suffers from errors in fine-grained details. To address this, we propose the Hierarchical Detail-Focused Recognition Network (HDNet), an innovative framework that incorporates a hierarchical sub-formula module, focusing on the precise handling of formula details, thereby significantly enhancing MER performance. Experimental results demonstrate that HDNet outperforms existing MER models across various datasets.
分层复杂数学表达式识别(MER)是一项挑战,因为一个公式可能有多种解释,这使得解析和评估都变得复杂。在本文中,我们介绍了分层细节识别数据集(HDR),这是第一个专门为解决这些问题而设计的数据集。它由大规模训练集 HDR-100M 和测试集 HDR-TM 组成。测试集 HDR-Test 包括对复杂分层公式的多种解释,用于全面评估模型性能。此外,复杂公式的解析经常会出现细节错误。为了解决这个问题,我们提出了分层细节识别网络(HDNet),这是一个创新的框架,其中包含一个分层子公式模块,重点是精确处理公式细节,从而显著提高 MER 性能。实验结果表明,在各种数据集上,HDNet 的性能均优于现有的 MER 模型。
{"title":"Enhancing Complex Formula Recognition with Hierarchical Detail-Focused Network","authors":"Jiale Wang, Junhui Yu, Huanyong Liu, Chenanran Kong","doi":"arxiv-2409.11677","DOIUrl":"https://doi.org/arxiv-2409.11677","url":null,"abstract":"Hierarchical and complex Mathematical Expression Recognition (MER) is\u0000challenging due to multiple possible interpretations of a formula, complicating\u0000both parsing and evaluation. In this paper, we introduce the Hierarchical\u0000Detail-Focused Recognition dataset (HDR), the first dataset specifically\u0000designed to address these issues. It consists of a large-scale training set,\u0000HDR-100M, offering an unprecedented scale and diversity with one hundred\u0000million training instances. And the test set, HDR-Test, includes multiple\u0000interpretations of complex hierarchical formulas for comprehensive model\u0000performance evaluation. Additionally, the parsing of complex formulas often\u0000suffers from errors in fine-grained details. To address this, we propose the\u0000Hierarchical Detail-Focused Recognition Network (HDNet), an innovative\u0000framework that incorporates a hierarchical sub-formula module, focusing on the\u0000precise handling of formula details, thereby significantly enhancing MER\u0000performance. Experimental results demonstrate that HDNet outperforms existing\u0000MER models across various datasets.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
{"title":"Linguini: A benchmark for language-agnostic linguistic reasoning","authors":"Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà","doi":"arxiv-2409.12126","DOIUrl":"https://doi.org/arxiv-2409.12126","url":null,"abstract":"We propose a new benchmark to measure a language model's linguistic reasoning\u0000skills without relying on pre-existing language-specific knowledge. The test\u0000covers 894 questions grouped in 160 problems across 75 (mostly) extremely\u0000low-resource languages, extracted from the International Linguistic Olympiad\u0000corpus. To attain high accuracy on this benchmark, models don't need previous\u0000knowledge of the tested language, as all the information needed to solve the\u0000linguistic puzzle is presented in the context. We find that, while all analyzed\u0000models rank below 25% accuracy, there is a significant gap between open and\u0000closed models, with the best-performing proprietary model at 24.05% and the\u0000best-performing open model at 8.84%.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang
In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation. The code will be available online.
{"title":"DocMamba: Efficient Document Pre-training with State Space Model","authors":"Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang","doi":"arxiv-2409.11887","DOIUrl":"https://doi.org/arxiv-2409.11887","url":null,"abstract":"In recent years, visually-rich document understanding has attracted\u0000increasing attention. Transformer-based pre-trained models have become the\u0000mainstream approach, yielding significant performance gains in this field.\u0000However, the self-attention mechanism's quadratic computational complexity\u0000hinders their efficiency and ability to process long documents. In this paper,\u0000we present DocMamba, a novel framework based on the state space model. It is\u0000designed to reduce computational complexity to linear while preserving global\u0000modeling capabilities. To further enhance its effectiveness in document\u0000processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture\u0000contiguous semantic information. Experimental results demonstrate that DocMamba\u0000achieves new state-of-the-art results on downstream datasets such as FUNSD,\u0000CORD, and SORIE, while significantly improving speed and reducing memory usage.\u0000Notably, experiments on the HRDoc confirm DocMamba's potential for length\u0000extrapolation. The code will be available online.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent success of large language models (LLMs) and the scaling law has led to a widespread adoption of larger models. Particularly in the healthcare industry, there is an increasing demand for locally operated LLMs due to security concerns. However, the majority of high quality open-source LLMs have a size of 70B parameters, imposing significant financial burdens on users for GPU preparation and operation. To overcome these issues, we present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources. We compare the performance on medical question-answering benchmarks in two languages (Japanese and English), demonstrating that its scores reach parity with or surpass those of currently existing medical LLMs that are ten times larger. We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language, supporting the effect of cross-lingual knowledge transfer. We hope that this study will alleviate financial challenges, serving as a stepping stone for clinical institutions to practically utilize LLMs locally. Our evaluation code is available at https://huggingface.co/stardust-coder/jmedllm-7b-v1.
{"title":"Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources","authors":"Issey Sukeda","doi":"arxiv-2409.11783","DOIUrl":"https://doi.org/arxiv-2409.11783","url":null,"abstract":"The recent success of large language models (LLMs) and the scaling law has\u0000led to a widespread adoption of larger models. Particularly in the healthcare\u0000industry, there is an increasing demand for locally operated LLMs due to\u0000security concerns. However, the majority of high quality open-source LLMs have\u0000a size of 70B parameters, imposing significant financial burdens on users for\u0000GPU preparation and operation. To overcome these issues, we present a medical\u0000adaptation based on the recent 7B models, which enables the operation in low\u0000computational resources. We compare the performance on medical\u0000question-answering benchmarks in two languages (Japanese and English),\u0000demonstrating that its scores reach parity with or surpass those of currently\u0000existing medical LLMs that are ten times larger. We find that fine-tuning an\u0000English-centric base model on Japanese medical dataset improves the score in\u0000both language, supporting the effect of cross-lingual knowledge transfer. We\u0000hope that this study will alleviate financial challenges, serving as a stepping\u0000stone for clinical institutions to practically utilize LLMs locally. Our\u0000evaluation code is available at\u0000https://huggingface.co/stardust-coder/jmedllm-7b-v1.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, Gene Louis Kim
As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona -- a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study investigates whether an LLM's understanding of social norms varies across assigned personas. Ideally, the perception of a social norm should remain consistent regardless of the persona, since acceptability of a social norm should be determined by the region the norm originates from, rather than by individual characteristics such as gender, body size, or race. A norm is universal within its cultural context. In our research, we tested 36 distinct personas from 12 sociodemographic categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs' cultural norm interpretation varies based on the persona used and the norm interpretation also varies within a sociodemographic category (e.g., a fat person and a thin person as in physical appearance group) where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accurately than with the less socially desirable persona (e.g., a fat person). We also discuss how different types of social biases may contribute to the results that we observe.
{"title":"\"A Woman is More Culturally Knowledgeable than A Man?\": The Effect of Personas on Cultural Norm Interpretation in LLMs","authors":"Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, Gene Louis Kim","doi":"arxiv-2409.11636","DOIUrl":"https://doi.org/arxiv-2409.11636","url":null,"abstract":"As the deployment of large language models (LLMs) expands, there is an\u0000increasing demand for personalized LLMs. One method to personalize and guide\u0000the outputs of these models is by assigning a persona -- a role that describes\u0000the expected behavior of the LLM (e.g., a man, a woman, an engineer). This\u0000study investigates whether an LLM's understanding of social norms varies across\u0000assigned personas. Ideally, the perception of a social norm should remain\u0000consistent regardless of the persona, since acceptability of a social norm\u0000should be determined by the region the norm originates from, rather than by\u0000individual characteristics such as gender, body size, or race. A norm is\u0000universal within its cultural context. In our research, we tested 36 distinct\u0000personas from 12 sociodemographic categories (e.g., age, gender, beauty) across\u0000four different LLMs. We find that LLMs' cultural norm interpretation varies\u0000based on the persona used and the norm interpretation also varies within a\u0000sociodemographic category (e.g., a fat person and a thin person as in physical\u0000appearance group) where an LLM with the more socially desirable persona (e.g.,\u0000a thin person) interprets social norms more accurately than with the less\u0000socially desirable persona (e.g., a fat person). We also discuss how different\u0000types of social biases may contribute to the results that we observe.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.
{"title":"BERT-VBD: Vietnamese Multi-Document Summarization Framework","authors":"Tuan-Cuong Vuong, Trang Mai Xuan, Thien Van Luong","doi":"arxiv-2409.12134","DOIUrl":"https://doi.org/arxiv-2409.12134","url":null,"abstract":"In tackling the challenge of Multi-Document Summarization (MDS), numerous\u0000methods have been proposed, spanning both extractive and abstractive\u0000summarization techniques. However, each approach has its own limitations,\u0000making it less effective to rely solely on either one. An emerging and\u0000promising strategy involves a synergistic fusion of extractive and abstractive\u0000summarization methods. Despite the plethora of studies in this domain, research\u0000on the combined methodology remains scarce, particularly in the context of\u0000Vietnamese language processing. This paper presents a novel Vietnamese MDS\u0000framework leveraging a two-component pipeline architecture that integrates\u0000extractive and abstractive techniques. The first component employs an\u0000extractive approach to identify key sentences within each document. This is\u0000achieved by a modification of the pre-trained BERT network, which derives\u0000semantically meaningful phrase embeddings using siamese and triplet network\u0000structures. The second component utilizes the VBD-LLaMA2-7B-50b model for\u0000abstractive summarization, ultimately generating the final summary document.\u0000Our proposed framework demonstrates a positive performance, attaining ROUGE-2\u0000scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art\u0000baselines.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}