In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi-agent LLM-based approach. We extend the propagandistic meme dataset with coarse and fine-grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community.
{"title":"Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-Agent LLMs","authors":"Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros","doi":"arxiv-2409.07246","DOIUrl":"https://doi.org/arxiv-2409.07246","url":null,"abstract":"In the past decade, social media platforms have been used for information\u0000dissemination and consumption. While a major portion of the content is posted\u0000to promote citizen journalism and public awareness, some content is posted to\u0000mislead users. Among different content types such as text, images, and videos,\u0000memes (text overlaid on images) are particularly prevalent and can serve as\u0000powerful vehicles for propaganda, hate, and humor. In the current literature,\u0000there have been efforts to individually detect such content in memes. However,\u0000the study of their intersection is very limited. In this study, we explore the\u0000intersection between propaganda and hate in memes using a multi-agent LLM-based\u0000approach. We extend the propagandistic meme dataset with coarse and\u0000fine-grained hate labels. Our finding suggests that there is an association\u0000between propaganda and hate in memes. We provide detailed experimental results\u0000that can serve as a baseline for future studies. We will make the experimental\u0000resources publicly available to the community.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt
Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.
{"title":"Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem","authors":"Qianli Wang, Tatiana Anikina, Nils Feldhus, Simon Ostermann, Sebastian Möller, Vera Schmitt","doi":"arxiv-2409.07123","DOIUrl":"https://doi.org/arxiv-2409.07123","url":null,"abstract":"Natural language explanations (NLEs) are vital for elucidating the reasoning\u0000behind large language model (LLM) decisions. Many techniques have been\u0000developed to generate NLEs using LLMs. However, like humans, LLMs might not\u0000always produce optimal NLEs on first attempt. Inspired by human learning\u0000processes, we introduce Cross-Refine, which employs role modeling by deploying\u0000two LLMs as generator and critic, respectively. The generator outputs a first\u0000NLE and then refines this initial explanation using feedback and suggestions\u0000provided by the critic. Cross-Refine does not require any supervised training\u0000data or additional training. We validate Cross-Refine across three NLP tasks\u0000using three state-of-the-art open-source LLMs through automatic and human\u0000evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which\u0000only utilizes self-feedback to refine the explanations. Our findings from\u0000automatic evaluation and a user study indicate that Cross-Refine outperforms\u0000Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful\u0000LLMs, whereas Self-Refine only yields strong results with ChatGPT.\u0000Additionally, we conduct an ablation study to assess the importance of feedback\u0000and suggestions. Both of them play an important role in refining explanations.\u0000We further evaluate Cross-Refine on a bilingual dataset in English and German.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) have revolutionized numerous applications, making them an integral part of our digital ecosystem. However, their reliability becomes critical, especially when these models are exposed to misinformation. We primarily analyze the susceptibility of state-of-the-art LLMs to factual inaccuracies when they encounter false information in a QnA scenario, an issue that can lead to a phenomenon we refer to as *knowledge drift*, which significantly undermines the trustworthiness of these models. We evaluate the factuality and the uncertainty of the models' responses relying on Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that an LLM's uncertainty can increase up to 56.6% when the question is answered incorrectly due to the exposure to false information. At the same time, repeated exposure to the same false information can decrease the models uncertainty again (-52.8% w.r.t. the answers on the untainted prompts), potentially manipulating the underlying model's beliefs and introducing a drift from its original knowledge. These findings provide insights into LLMs' robustness and vulnerability to adversarial inputs, paving the way for developing more reliable LLM applications across various domains. The code is available at https://github.com/afastowski/knowledge_drift.
{"title":"Understanding Knowledge Drift in LLMs through Misinformation","authors":"Alina Fastowski, Gjergji Kasneci","doi":"arxiv-2409.07085","DOIUrl":"https://doi.org/arxiv-2409.07085","url":null,"abstract":"Large Language Models (LLMs) have revolutionized numerous applications,\u0000making them an integral part of our digital ecosystem. However, their\u0000reliability becomes critical, especially when these models are exposed to\u0000misinformation. We primarily analyze the susceptibility of state-of-the-art\u0000LLMs to factual inaccuracies when they encounter false information in a QnA\u0000scenario, an issue that can lead to a phenomenon we refer to as *knowledge\u0000drift*, which significantly undermines the trustworthiness of these models. We\u0000evaluate the factuality and the uncertainty of the models' responses relying on\u0000Entropy, Perplexity, and Token Probability metrics. Our experiments reveal that\u0000an LLM's uncertainty can increase up to 56.6% when the question is answered\u0000incorrectly due to the exposure to false information. At the same time,\u0000repeated exposure to the same false information can decrease the models\u0000uncertainty again (-52.8% w.r.t. the answers on the untainted prompts),\u0000potentially manipulating the underlying model's beliefs and introducing a drift\u0000from its original knowledge. These findings provide insights into LLMs'\u0000robustness and vulnerability to adversarial inputs, paving the way for\u0000developing more reliable LLM applications across various domains. The code is\u0000available at https://github.com/afastowski/knowledge_drift.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters. This can hurt performance when using standard decoding techniques, which tend to ignore the context. Existing test-time contrastive methods seek to address this by comparing the LLM's output distribution with and without the context and adjust the model according to the contrast between them. However, we find that these methods frequently misjudge the degree of conflict and struggle to handle instances that vary in their amount of conflict, with static methods over-adjusting when conflict is absent. We propose a fine-grained, instance-level approach called AdaCAD, which dynamically infers the weight of adjustment based on the degree of conflict, as measured by the Jensen-Shannon divergence between distributions representing contextual and parametric knowledge. Our experiments across four models on six diverse question-answering (QA) datasets and three summarization tasks demonstrate that our training-free adaptive method consistently outperforms other decoding methods on QA, with average accuracy gains of 14.21% (absolute) over a static contrastive baseline, and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our analysis shows that while decoding with contrastive baselines hurts performance when conflict is absent, AdaCAD mitigates these losses, making it more applicable to real-world datasets in which some examples have conflict and others do not.
{"title":"AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge","authors":"Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.07394","DOIUrl":"https://doi.org/arxiv-2409.07394","url":null,"abstract":"Knowledge conflict arises from discrepancies between information in the\u0000context of a large language model (LLM) and the knowledge stored in its\u0000parameters. This can hurt performance when using standard decoding techniques,\u0000which tend to ignore the context. Existing test-time contrastive methods seek\u0000to address this by comparing the LLM's output distribution with and without the\u0000context and adjust the model according to the contrast between them. However,\u0000we find that these methods frequently misjudge the degree of conflict and\u0000struggle to handle instances that vary in their amount of conflict, with static\u0000methods over-adjusting when conflict is absent. We propose a fine-grained,\u0000instance-level approach called AdaCAD, which dynamically infers the weight of\u0000adjustment based on the degree of conflict, as measured by the Jensen-Shannon\u0000divergence between distributions representing contextual and parametric\u0000knowledge. Our experiments across four models on six diverse question-answering\u0000(QA) datasets and three summarization tasks demonstrate that our training-free\u0000adaptive method consistently outperforms other decoding methods on QA, with\u0000average accuracy gains of 14.21% (absolute) over a static contrastive baseline,\u0000and improves the factuality of summaries by 5.59 (AlignScore). Furthermore, our\u0000analysis shows that while decoding with contrastive baselines hurts performance\u0000when conflict is absent, AdaCAD mitigates these losses, making it more\u0000applicable to real-world datasets in which some examples have conflict and\u0000others do not.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee
Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
{"title":"Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model","authors":"Daehee Kim, Deokhyung Kang, Sangwon Ryu, Gary Geunbae Lee","doi":"arxiv-2409.07088","DOIUrl":"https://doi.org/arxiv-2409.07088","url":null,"abstract":"Knowledge Graph-to-Text (G2T) generation involves verbalizing structured\u0000knowledge graphs into natural language text. Recent advancements in Pretrained\u0000Language Models (PLMs) have improved G2T performance, but their effectiveness\u0000depends on datasets with precise graph-text alignment. However, the scarcity of\u0000high-quality, general-domain G2T generation datasets restricts progress in the\u0000general-domain G2T generation research. To address this issue, we introduce\u0000Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T\u0000dataset generated using a novel method that leverages Large Language Model\u0000(LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain\u0000graph-text pairs, offers high graph-text consistency without relying on\u0000external ontologies. Experimental results demonstrate that PLM fine-tuned on\u0000WikiOFGraph outperforms those trained on other datasets across various\u0000evaluation metrics. Our method proves to be a scalable and effective solution\u0000for generating high-quality G2T data, significantly advancing the field of G2T\u0000generation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy of testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.
{"title":"Enhancing adversarial robustness in Natural Language Inference using explanations","authors":"Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou","doi":"arxiv-2409.07423","DOIUrl":"https://doi.org/arxiv-2409.07423","url":null,"abstract":"The surge of state-of-the-art Transformer-based models has undoubtedly pushed\u0000the limits of NLP model performance, excelling in a variety of tasks. We cast\u0000the spotlight on the underexplored task of Natural Language Inference (NLI),\u0000since models trained on popular well-suited datasets are susceptible to\u0000adversarial attacks, allowing subtle input interventions to mislead the model.\u0000In this work, we validate the usage of natural language explanation as a\u0000model-agnostic defence strategy through extensive experimentation: only by\u0000fine-tuning a classifier on the explanation rather than premise-hypothesis\u0000inputs, robustness under various adversarial attacks is achieved in comparison\u0000to explanation-free baselines. Moreover, since there is no standard strategy of\u0000testing the semantic validity of the generated explanations, we research the\u0000correlation of widely used language generation metrics with human perception,\u0000in order for them to serve as a proxy towards robust NLI models. Our approach\u0000is resource-efficient and reproducible without significant computational\u0000limitations.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"2019 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
{"title":"Agent Workflow Memory","authors":"Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig","doi":"arxiv-2409.07429","DOIUrl":"https://doi.org/arxiv-2409.07429","url":null,"abstract":"Despite the potential of language model-based agents to solve real-world\u0000tasks such as web navigation, current methods still struggle with long-horizon\u0000tasks with complex action trajectories. In contrast, humans can flexibly solve\u0000complex tasks by learning reusable task workflows from past experiences and\u0000using them to guide future actions. To build agents that can similarly benefit\u0000from this process, we introduce Agent Workflow Memory (AWM), a method for\u0000inducing commonly reused routines, i.e., workflows, and selectively providing\u0000workflows to the agent to guide subsequent generations. AWM flexibly applies to\u0000both offline and online scenarios, where agents induce workflows from training\u0000examples beforehand or from test queries on the fly. We experiment on two major\u0000web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover\u00001000+ tasks from 200+ domains across travel, shopping, and social media, among\u0000others. AWM substantially improves the baseline results by 24.6% and 51.1%\u0000relative success rate on Mind2Web and WebArena while reducing the number of\u0000steps taken to solve WebArena tasks successfully. Furthermore, online AWM\u0000robustly generalizes in cross-task, website, and domain evaluations, surpassing\u0000baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps\u0000widen.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study introduces textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at textbf{url{https://github.com/BBeeChu/InteractEval.git}}.
{"title":"Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation","authors":"SeongYeub Chu, JongWoo Kim, MunYong Yi","doi":"arxiv-2409.07355","DOIUrl":"https://doi.org/arxiv-2409.07355","url":null,"abstract":"This study introduces textbf{InteractEval}, a framework that integrates\u0000human expertise and Large Language Models (LLMs) using the Think-Aloud (TA)\u0000method to generate attributes for checklist-based text evaluation. By combining\u0000human flexibility and reasoning with LLM consistency, InteractEval outperforms\u0000traditional non-LLM-based and LLM-based baselines across four distinct\u0000dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The\u0000experiment also investigates the effectiveness of the TA method, showing that\u0000it promotes divergent thinking in both humans and LLMs, leading to the\u0000generation of a wider range of relevant attributes and enhance text evaluation\u0000performance. Comparative analysis reveals that humans excel at identifying\u0000attributes related to internal quality (Coherence and Fluency), but LLMs\u0000perform better at those attributes related to external alignment (Consistency\u0000and Relevance). Consequently, leveraging both humans and LLMs together produces\u0000the best evaluation outcomes. In other words, this study emphasizes the\u0000necessity of effectively combining humans and LLMs in an automated\u0000checklist-based text evaluation framework. The code is available at\u0000textbf{url{https://github.com/BBeeChu/InteractEval.git}}.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan D. Thomas, Andrea Silvi, Devdatt Dubhashi, Emil Carlsson, Moa Johansson
The emergence of mathematical concepts, such as number systems, is an understudied area in AI for mathematics and reasoning. It has previously been shown Carlsson et al. (2021) that by using reinforcement learning (RL), agents can derive simple approximate and exact-restricted numeral systems. However, it is a major challenge to show how more complex recursive numeral systems, similar to the one utilised in English, could arise via a simple learning mechanism such as RL. Here, we introduce an approach towards deriving a mechanistic explanation of the emergence of recursive number systems where we consider an RL agent which directly optimizes a lexicon under a given meta-grammar. Utilising a slightly modified version of the seminal meta-grammar of Hurford (1975), we demonstrate that our RL agent can effectively modify the lexicon towards Pareto-optimal configurations which are comparable to those observed within human numeral systems.
{"title":"Learning Efficient Recursive Numeral Systems via Reinforcement Learning","authors":"Jonathan D. Thomas, Andrea Silvi, Devdatt Dubhashi, Emil Carlsson, Moa Johansson","doi":"arxiv-2409.07170","DOIUrl":"https://doi.org/arxiv-2409.07170","url":null,"abstract":"The emergence of mathematical concepts, such as number systems, is an\u0000understudied area in AI for mathematics and reasoning. It has previously been\u0000shown Carlsson et al. (2021) that by using reinforcement learning (RL), agents\u0000can derive simple approximate and exact-restricted numeral systems. However, it\u0000is a major challenge to show how more complex recursive numeral systems,\u0000similar to the one utilised in English, could arise via a simple learning\u0000mechanism such as RL. Here, we introduce an approach towards deriving a\u0000mechanistic explanation of the emergence of recursive number systems where we\u0000consider an RL agent which directly optimizes a lexicon under a given\u0000meta-grammar. Utilising a slightly modified version of the seminal meta-grammar\u0000of Hurford (1975), we demonstrate that our RL agent can effectively modify the\u0000lexicon towards Pareto-optimal configurations which are comparable to those\u0000observed within human numeral systems.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"102 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor, Boushra Bendou, Maram Hasanain, Firoj Alam
Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources--digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 12 different Arabic datasets (9.7K data points). In total, we conducted 197 experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts.
{"title":"Native vs Non-Native Language Prompting: A Comparative Analysis","authors":"Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor, Boushra Bendou, Maram Hasanain, Firoj Alam","doi":"arxiv-2409.07054","DOIUrl":"https://doi.org/arxiv-2409.07054","url":null,"abstract":"Large language models (LLMs) have shown remarkable abilities in different\u0000fields, including standard Natural Language Processing (NLP) tasks. To elicit\u0000knowledge from LLMs, prompts play a key role, consisting of natural language\u0000instructions. Most open and closed source LLMs are trained on available labeled\u0000and unlabeled resources--digital content such as text, images, audio, and\u0000videos. Hence, these models have better knowledge for high-resourced languages\u0000but struggle with low-resourced languages. Since prompts play a crucial role in\u0000understanding their capabilities, the language used for prompts remains an\u0000important research question. Although there has been significant research in\u0000this area, it is still limited, and less has been explored for medium to\u0000low-resourced languages. In this study, we investigate different prompting\u0000strategies (native vs. non-native) on 11 different NLP tasks associated with 12\u0000different Arabic datasets (9.7K data points). In total, we conducted 197\u0000experiments involving 3 LLMs, 12 datasets, and 3 prompting strategies. Our\u0000findings suggest that, on average, the non-native prompt performs the best,\u0000followed by mixed and native prompts.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142184356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}