Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can be beneficial to depict the problem-solving process as well. In this paper, we proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM to generate an initial response, then expresses intermediate problem-solving steps to a graph structure. After that, it employs graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches. Our code is released at https://github.com/Yukang-Lin/RGER.
{"title":"Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning","authors":"Yukang Lin, Bingchen Zhong, Shuoran Jiang, Joanna Siebert, Qingcai Chen","doi":"arxiv-2409.11147","DOIUrl":"https://doi.org/arxiv-2409.11147","url":null,"abstract":"Large language models(LLMs) have exhibited remarkable few-shot learning\u0000capabilities and unified the paradigm of NLP tasks through the in-context\u0000learning(ICL) technique. Despite the success of ICL, the quality of the\u0000exemplar demonstrations can significantly influence the LLM's performance.\u0000Existing exemplar selection methods mainly focus on the semantic similarity\u0000between queries and candidate exemplars. On the other hand, the logical\u0000connections between reasoning steps can be beneficial to depict the\u0000problem-solving process as well. In this paper, we proposes a novel method\u0000named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM\u0000to generate an initial response, then expresses intermediate problem-solving\u0000steps to a graph structure. After that, it employs graph kernel to select\u0000exemplars with semantic and structural similarity. Extensive experiments\u0000demonstrate the structural relationship is helpful to the alignment of queries\u0000and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks\u0000showcases its superiority over state-of-the-art retrieval-based approaches. Our\u0000code is released at https://github.com/Yukang-Lin/RGER.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong
LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.
{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":"https://doi.org/arxiv-2409.11239","url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\u0000multiple-choice questions or human annotators for large language model (LLM)\u0000evaluation. Their efficacy shines in evaluating long-form responses, serving a\u0000critical role as evaluators of leaderboards and as proxies to align LLMs via\u0000reinforcement learning. However, despite their popularity, their effectiveness\u0000outside of English remains largely unexplored. In this paper, we conduct a\u0000comprehensive analysis on automated evaluators, reporting key findings on their\u0000behavior in a non-English environment. First, we discover that English\u0000evaluation capabilities significantly influence language-specific capabilities,\u0000often more than the language proficiency itself, enabling evaluators trained in\u0000English to easily transfer their skills to other languages. Second, we identify\u0000critical shortcomings, where LLMs fail to detect and penalize errors, such as\u0000factual inaccuracies, cultural misrepresentations, and the presence of unwanted\u0000language. Finally, we release Kudge, the first non-English meta-evaluation\u0000dataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.
{"title":"Egalitarian Language Representation in Language Models: It All Begins with Tokenizers","authors":"Menan Velayuthan, Kengatharaiyer Sarveswaran","doi":"arxiv-2409.11501","DOIUrl":"https://doi.org/arxiv-2409.11501","url":null,"abstract":"Tokenizers act as a bridge between human language and the latent space of\u0000language models, influencing how language is represented in these models. Due\u0000to the immense popularity of English-Centric Large Language Models (LLMs),\u0000efforts are being made to adapt them for other languages. However, we\u0000demonstrate that, from a tokenization standpoint, not all tokenizers offer fair\u0000representation for complex script languages such as Tamil, Sinhala, and Hindi,\u0000primarily due to the choice of pre-tokenization methods. We go further to show\u0000that pre-tokenization plays a more critical role than the tokenization\u0000algorithm itself in achieving an egalitarian representation of these complex\u0000script languages. To address this, we introduce an improvement to the Byte Pair\u0000Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme\u0000Pair Encoding (GPE). Our experiments show that grapheme-based character\u0000extraction outperforms byte-level tokenizers for complex scripts. We validate\u0000this approach through experiments on Tamil, Sinhala, and Hindi.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text generation and binary classification, applying optimal mitigation strategies like In-Context Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base of academic papers, political news, and Wikipedia reveal that commercial models like GPT-4o benefit more from RAG than ICL, while open-weight models like Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT significantly enhances the performance of Llama-3.1-8B-Instruct in both evaluation tasks.
{"title":"THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models","authors":"Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Treleaven","doi":"arxiv-2409.11353","DOIUrl":"https://doi.org/arxiv-2409.11353","url":null,"abstract":"Hallucination, the generation of factually incorrect content, is a growing\u0000challenge in Large Language Models (LLMs). Existing detection and mitigation\u0000methods are often isolated and insufficient for domain-specific needs, lacking\u0000a standardized pipeline. This paper introduces THaMES (Tool for Hallucination\u0000Mitigations and EvaluationS), an integrated framework and library addressing\u0000this gap. THaMES offers an end-to-end solution for evaluating and mitigating\u0000hallucinations in LLMs, featuring automated test set generation, multifaceted\u0000benchmarking, and adaptable mitigation strategies. It automates test set\u0000creation from any corpus, ensuring high data quality, diversity, and\u0000cost-efficiency through techniques like batch processing, weighted sampling,\u0000and counterfactual validation. THaMES assesses a model's ability to detect and\u0000reduce hallucinations across various tasks, including text generation and\u0000binary classification, applying optimal mitigation strategies like In-Context\u0000Learning (ICL), Retrieval Augmented Generation (RAG), and Parameter-Efficient\u0000Fine-tuning (PEFT). Evaluations of state-of-the-art LLMs using a knowledge base\u0000of academic papers, political news, and Wikipedia reveal that commercial models\u0000like GPT-4o benefit more from RAG than ICL, while open-weight models like\u0000Llama-3.1-8B-Instruct and Mistral-Nemo gain more from ICL. Additionally, PEFT\u0000significantly enhances the performance of Llama-3.1-8B-Instruct in both\u0000evaluation tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar
This paper introduces the concept of an education tool that utilizes Generative Artificial Intelligence (GenAI) to enhance storytelling for children. The system combines GenAI-driven narrative co-creation, text-to-speech conversion, and text-to-video generation to produce an engaging experience for learners. We describe the co-creation process, the adaptation of narratives into spoken words using text-to-speech models, and the transformation of these narratives into contextually relevant visuals through text-to-video technology. Our evaluation covers the linguistics of the generated stories, the text-to-speech conversion quality, and the accuracy of the generated visuals.
{"title":"The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives","authors":"Samee Arif, Taimoor Arif, Aamina Jamal Khan, Muhammad Saad Haroon, Agha Ali Raza, Awais Athar","doi":"arxiv-2409.11261","DOIUrl":"https://doi.org/arxiv-2409.11261","url":null,"abstract":"This paper introduces the concept of an education tool that utilizes\u0000Generative Artificial Intelligence (GenAI) to enhance storytelling for\u0000children. The system combines GenAI-driven narrative co-creation,\u0000text-to-speech conversion, and text-to-video generation to produce an engaging\u0000experience for learners. We describe the co-creation process, the adaptation of\u0000narratives into spoken words using text-to-speech models, and the\u0000transformation of these narratives into contextually relevant visuals through\u0000text-to-video technology. Our evaluation covers the linguistics of the\u0000generated stories, the text-to-speech conversion quality, and the accuracy of\u0000the generated visuals.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
{"title":"Multi-Document Grounded Multi-Turn Synthetic Dialog Generation","authors":"Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian","doi":"arxiv-2409.11500","DOIUrl":"https://doi.org/arxiv-2409.11500","url":null,"abstract":"We introduce a technique for multi-document grounded multi-turn synthetic\u0000dialog generation that incorporates three main ideas. First, we control the\u0000overall dialog flow using taxonomy-driven user queries that are generated with\u0000Chain-of-Thought (CoT) prompting. Second, we support the generation of\u0000multi-document grounded dialogs by mimicking real-world use of retrievers to\u0000update the grounding documents after every user-turn in the dialog. Third, we\u0000apply LLM-as-a-Judge to filter out queries with incorrect answers. Human\u0000evaluation of the synthetic dialog data suggests that the data is diverse,\u0000coherent, and includes mostly correct answers. Both human and automatic\u0000evaluations of answerable queries indicate that models fine-tuned on synthetic\u0000dialogs consistently out-perform those fine-tuned on existing human generated\u0000training data across four publicly available multi-turn document grounded\u0000benchmark test sets.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.
{"title":"Linear Recency Bias During Training Improves Transformers' Fit to Reading Times","authors":"Christian Clark, Byung-Doh Oh, William Schuler","doi":"arxiv-2409.11250","DOIUrl":"https://doi.org/arxiv-2409.11250","url":null,"abstract":"Recent psycholinguistic research has compared human reading times to\u0000surprisal estimates from language models to study the factors shaping human\u0000sentence processing difficulty. Previous studies have shown a strong fit\u0000between surprisal values from Transformers and reading times. However, standard\u0000Transformers work with a lossless representation of the entire previous\u0000linguistic context, unlike models of human language processing that include\u0000memory decay. To bridge this gap, this paper evaluates a modification of the\u0000Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to\u0000attention scores. Surprisal estimates with ALiBi show an improved fit to human\u0000reading times compared to a standard Transformer baseline. A subsequent\u0000analysis of attention heads suggests that ALiBi's mixture of slopes -- which\u0000determine the rate of memory decay in each attention head -- may play a role in\u0000the improvement by helping models with ALiBi to track different kinds of\u0000linguistic dependencies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1243 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven
Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labeled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias over time within model families.
{"title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":"https://doi.org/arxiv-2409.11579","url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\u0000state-of-the-art LLMs using in-context learning struggle to identify them\u0000accurately. Due to the subjective nature of stereotypes, where what constitutes\u0000a stereotype can vary widely depending on cultural, social, and individual\u0000perspectives, robust explainability is crucial. Explainable models ensure that\u0000these nuanced judgments can be understood and validated by human users,\u0000promoting trust and accountability. We address these challenges by introducing\u0000HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\u0000Stereotype Detection), a framework that enhances model performance, minimises\u0000carbon footprint, and provides transparent, interpretable explanations. We\u0000establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\u000057,201 labeled texts across six groups, including under-represented\u0000demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\u0000that BERT models fine-tuned on EMGSD outperform those trained on individual\u0000components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\u0000using SHAP to generate token-level importance values, ensuring alignment with\u0000human understanding, and calculate explainability confidence scores by\u0000comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\u0000stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\u0000over time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat
Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.
{"title":"Enriching Datasets with Demographics through Large Language Models: What's in a Name?","authors":"Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut, Abdulla AlKetbi, Andreas Henschel, Raed Jaradat","doi":"arxiv-2409.11491","DOIUrl":"https://doi.org/arxiv-2409.11491","url":null,"abstract":"Enriching datasets with demographic information, such as gender, race, and\u0000age from names, is a critical task in fields like healthcare, public policy,\u0000and social sciences. Such demographic insights allow for more precise and\u0000effective engagement with target populations. Despite previous efforts\u0000employing hidden Markov models and recurrent neural networks to predict\u0000demographics from names, significant limitations persist: the lack of\u0000large-scale, well-curated, unbiased, publicly available datasets, and the lack\u0000of an approach robust across datasets. This scarcity has hindered the\u0000development of traditional supervised learning approaches. In this paper, we\u0000demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can\u0000perform as well as, if not better than, bespoke models trained on specialized\u0000data. We apply these LLMs to a variety of datasets, including a real-life,\u0000unlabelled dataset of licensed financial professionals in Hong Kong, and\u0000critically assess the inherent demographic biases in these models. Our work not\u0000only advances the state-of-the-art in demographic enrichment but also opens\u0000avenues for future research in mitigating biases in LLMs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an textbf{U}ncertainty-enhanced textbf{P}reference textbf{O}ptimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference data derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the preference data derived from the LLM policy. Compared to the existing methods that directly filter generated responses based on the reward score, the estimator focuses on the model uncertainty in a pair-wise manner and effectively bypasses the confirmation bias problem of the reward model. Additionally, we also propose an uncertainty-enhanced self-evolution algorithm to improve the robustness of preference optimization and encourage the LLM to generate responses with both high reward and certainty. Extensive experiments over multiple benchmarks demonstrate that our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
{"title":"Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization","authors":"Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan","doi":"arxiv-2409.11212","DOIUrl":"https://doi.org/arxiv-2409.11212","url":null,"abstract":"Iterative preference optimization has recently become one of the de-facto\u0000training paradigms for large language models (LLMs), but the performance is\u0000still underwhelming due to too much noisy preference data yielded in the loop.\u0000To combat this issue, we present an textbf{U}ncertainty-enhanced\u0000textbf{P}reference textbf{O}ptimization (UPO) framework to make the LLM\u0000self-evolve with reliable feedback. The key idea is mitigating the noisy\u0000preference data derived from the current policy and reward models by performing\u0000pair-wise uncertainty estimation and judiciously reliable feedback sampling. To\u0000reach this goal, we thus introduce an estimator model, which incorporates Monte\u0000Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty\u0000estimation for the preference data derived from the LLM policy. Compared to the\u0000existing methods that directly filter generated responses based on the reward\u0000score, the estimator focuses on the model uncertainty in a pair-wise manner and\u0000effectively bypasses the confirmation bias problem of the reward model.\u0000Additionally, we also propose an uncertainty-enhanced self-evolution algorithm\u0000to improve the robustness of preference optimization and encourage the LLM to\u0000generate responses with both high reward and certainty. Extensive experiments\u0000over multiple benchmarks demonstrate that our framework substantially\u0000alleviates the noisy problem and improves the performance of iterative\u0000preference optimization.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}