HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven
{"title":"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":null,"url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\nstate-of-the-art LLMs using in-context learning struggle to identify them\naccurately. Due to the subjective nature of stereotypes, where what constitutes\na stereotype can vary widely depending on cultural, social, and individual\nperspectives, robust explainability is crucial. Explainable models ensure that\nthese nuanced judgments can be understood and validated by human users,\npromoting trust and accountability. We address these challenges by introducing\nHEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\nStereotype Detection), a framework that enhances model performance, minimises\ncarbon footprint, and provides transparent, interpretable explanations. We\nestablish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\n57,201 labeled texts across six groups, including under-represented\ndemographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\nthat BERT models fine-tuned on EMGSD outperform those trained on individual\ncomponents. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\nusing SHAP to generate token-level importance values, ensuring alignment with\nhuman understanding, and calculate explainability confidence scores by\ncomparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\nstereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\nover time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labeled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias over time within model families.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HEARTS:可解释、可持续和稳健的文本刻板印象检测整体框架
刻板印象是对社会群体的概括性假设,即使是最先进的语境学习 LLM 也难以准确识别刻板印象。由于刻板印象具有主观性,构成刻板印象的因素会因文化、社会和个人观点的不同而有很大差异,因此强大的可解释性至关重要。可解释模型可确保这些细微的判断能够被人类用户理解和验证,从而提高信任度和责任感。为了应对这些挑战,我们引入了可解释、可持续和稳健的文本原型检测整体框架(HEARTS),这是一个能提高模型性能、最大限度减少碳足迹并提供透明、可解释的解释的框架。我们建立了扩展的多粒度刻板印象数据集(EMGSD),该数据集包含六组共 57201 个标注文本,其中包括 LGBTQ+ 和地区刻板印象等代表性不足的人群。消融研究证实,根据 EMGSD 微调的 BERT 模型优于根据单个组件训练的模型。然后,我们分析了经过微调的高碳效率 ALBERT-V2 模型,利用 SHAP 生成标记级重要性值,确保与人类理解相一致,并通过比较 SHAP 和 LIME 输出计算可解释性置信度分数。最后,HEARTS 被应用于评估 12 个 LLM 输出中的刻板偏见,揭示了随着时间的推移,模型族中的偏见逐渐减少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1