Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang
Unified information extraction (UIE) aims to complete all information extraction tasks using a single model or framework. While previous work has primarily focused on instruction-tuning large language models (LLMs) with constructed datasets, these methods require significant computational resources and struggle to generalize to unseen tasks. To address these limitations, we propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization while reducing computational costs. The key challenge in RUIE is selecting the most beneficial demonstrations for LLMs to effectively handle diverse IE tasks. To achieve this, we integrate LLM preferences for ranking candidate demonstrations and design a keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations. We then train a bi-encoder retriever for UIE through contrastive learning and knowledge distillation. To the best of our knowledge, RUIE is the first trainable retrieval framework for UIE. Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in generalizing to unseen tasks, with average F1-score improvements of 19.22 and 3.13 compared to instruction-tuning methods and other retrievers, respectively. Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the importance of its key components.
{"title":"RUIE: Retrieval-based Unified Information Extraction using Large Language Model","authors":"Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang","doi":"arxiv-2409.11673","DOIUrl":"https://doi.org/arxiv-2409.11673","url":null,"abstract":"Unified information extraction (UIE) aims to complete all information\u0000extraction tasks using a single model or framework. While previous work has\u0000primarily focused on instruction-tuning large language models (LLMs) with\u0000constructed datasets, these methods require significant computational resources\u0000and struggle to generalize to unseen tasks. To address these limitations, we\u0000propose RUIE (Retrieval-based Unified Information Extraction), a framework that\u0000leverages in-context learning to enable rapid generalization while reducing\u0000computational costs. The key challenge in RUIE is selecting the most beneficial\u0000demonstrations for LLMs to effectively handle diverse IE tasks. To achieve\u0000this, we integrate LLM preferences for ranking candidate demonstrations and\u0000design a keyword-enhanced reward model to capture fine-grained relationships\u0000between queries and demonstrations. We then train a bi-encoder retriever for\u0000UIE through contrastive learning and knowledge distillation. To the best of our\u0000knowledge, RUIE is the first trainable retrieval framework for UIE.\u0000Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in\u0000generalizing to unseen tasks, with average F1-score improvements of 19.22 and\u00003.13 compared to instruction-tuning methods and other retrievers, respectively.\u0000Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the\u0000importance of its key components.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.
{"title":"MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning","authors":"Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.12147","DOIUrl":"https://doi.org/arxiv-2409.12147","url":null,"abstract":"Large Language Models' (LLM) reasoning can be improved using test-time\u0000aggregation strategies, i.e., generating multiple samples and voting among\u0000generated samples. While these improve performance, they often reach a\u0000saturation point. Refinement offers an alternative by using LLM-generated\u0000feedback to improve solution quality. However, refinement introduces 3 key\u0000challenges: (1) Excessive refinement: Uniformly refining all instances can\u0000over-correct and reduce the overall performance. (2) Inability to localize and\u0000address errors: LLMs have a limited ability to self-correct and struggle to\u0000identify and correct their own mistakes. (3) Insufficient refinement: Deciding\u0000how many iterations of refinement are needed is non-trivial, and stopping too\u0000soon could leave errors unaddressed. To tackle these issues, we propose\u0000MAgICoRe, which avoids excessive refinement by categorizing problem difficulty\u0000as easy or hard, solving easy problems with coarse-grained aggregation and hard\u0000ones with fine-grained and iterative multi-agent refinement. To improve error\u0000localization, we incorporate external step-wise reward model (RM) scores.\u0000Moreover, to ensure effective refinement, we employ a multi-agent loop with\u0000three agents: Solver, Reviewer (which generates targeted feedback based on\u0000step-wise RM scores), and the Refiner (which incorporates feedback). To ensure\u0000sufficient refinement, we re-evaluate updated solutions, iteratively initiating\u0000further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5\u0000and show its effectiveness across 5 math datasets. Even one iteration of\u0000MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by\u00004.0% while using less than half the samples. Unlike iterative refinement with\u0000baselines, MAgICoRe continues to improve with more iterations. Finally, our\u0000ablations highlight the importance of MAgICoRe's RMs and multi-agent\u0000communication.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The role of large language models (LLMs) in education is an increasing area of interest today, considering the new opportunities they offer for teaching, learning, and assessment. This cutting-edge tutorial provides an overview of the educational applications of NLP and the impact that the recent advances in LLMs have had on this field. We will discuss the key challenges and opportunities presented by LLMs, grounding them in the context of four major educational applications: reading, writing, and speaking skills, and intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for researchers and practitioners interested in the educational applications of NLP and the role LLMs have to play in this area. It is the first of its kind to address this timely topic.
{"title":"LLMs in Education: Novel Perspectives, Challenges, and Opportunities","authors":"Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar","doi":"arxiv-2409.11917","DOIUrl":"https://doi.org/arxiv-2409.11917","url":null,"abstract":"The role of large language models (LLMs) in education is an increasing area\u0000of interest today, considering the new opportunities they offer for teaching,\u0000learning, and assessment. This cutting-edge tutorial provides an overview of\u0000the educational applications of NLP and the impact that the recent advances in\u0000LLMs have had on this field. We will discuss the key challenges and\u0000opportunities presented by LLMs, grounding them in the context of four major\u0000educational applications: reading, writing, and speaking skills, and\u0000intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for\u0000researchers and practitioners interested in the educational applications of NLP\u0000and the role LLMs have to play in this area. It is the first of its kind to\u0000address this timely topic.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
{"title":"To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning","authors":"Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett","doi":"arxiv-2409.12183","DOIUrl":"https://doi.org/arxiv-2409.12183","url":null,"abstract":"Chain-of-thought (CoT) via prompting is the de facto method for eliciting\u0000reasoning capabilities from large language models (LLMs). But for what kinds of\u0000tasks is this extra ``thinking'' really helpful? To analyze this, we conducted\u0000a quantitative meta-analysis covering over 100 papers using CoT and ran our own\u0000evaluations of 20 datasets across 14 models. Our results show that CoT gives\u0000strong performance benefits primarily on tasks involving math or logic, with\u0000much smaller gains on other types of tasks. On MMLU, directly generating the\u0000answer without CoT leads to almost identical accuracy as CoT unless the\u0000question or model's response contains an equals sign, indicating symbolic\u0000operations and reasoning. Following this finding, we analyze the behavior of\u0000CoT on these problems by separating planning and execution and comparing\u0000against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic\u0000execution, but it underperforms relative to using a symbolic solver. Our\u0000results indicate that CoT can be applied selectively, maintaining performance\u0000while saving inference costs. Furthermore, they suggest a need to move beyond\u0000prompt-based CoT to new paradigms that better leverage intermediate computation\u0000across the whole range of LLM applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.
{"title":"Qwen2.5-Coder Technical Report","authors":"Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12186","DOIUrl":"https://doi.org/arxiv-2409.12186","url":null,"abstract":"In this report, we introduce the Qwen2.5-Coder series, a significant upgrade\u0000from its predecessor, CodeQwen1.5. This series includes two models:\u0000Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,\u0000Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained\u0000on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,\u0000scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder\u0000demonstrates impressive code generation capabilities while retaining general\u0000versatility. The model has been evaluated on a wide range of code-related\u0000tasks, achieving state-of-the-art (SOTA) performance across more than 10\u0000benchmarks, including code generation, completion, reasoning, and repair,\u0000consistently outperforming larger models of the same model size. We believe\u0000that the release of the Qwen2.5-Coder series will not only push the boundaries\u0000of research in code intelligence but also, through its permissive licensing,\u0000encourage broader adoption by developers in real-world applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":"https://doi.org/arxiv-2409.12136","url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\u0000to sparse computation through expert routing, selectively activating only a\u0000small subset of expert modules. However, sparse computation challenges\u0000traditional training practices, as discrete expert routing hinders standard\u0000backpropagation and thus gradient-based optimization, which are the cornerstone\u0000of deep learning. To better pursue the scaling power of MoE, we introduce GRIN\u0000(GRadient-INformed MoE training), which incorporates sparse gradient estimation\u0000for expert routing and configures model parallelism to avoid token dropping.\u0000Applying GRIN to autoregressive language modeling, we develop a top-2\u000016$times$3.8B MoE model. Our model, with only 6.6B activated parameters,\u0000outperforms a 7B dense model and matches the performance of a 14B dense model\u0000trained on the same data. Extensive evaluations across diverse tasks\u0000demonstrate the potential of GRIN to significantly enhance MoE efficacy,\u0000achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
{"title":"Human-like Affective Cognition in Foundation Models","authors":"Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman","doi":"arxiv-2409.11733","DOIUrl":"https://doi.org/arxiv-2409.11733","url":null,"abstract":"Understanding emotions is fundamental to human interaction and experience.\u0000Humans easily infer emotions from situations or facial expressions, situations\u0000from emotions, and do a variety of other emph{affective cognition}. How adept\u0000is modern AI at these inferences? We introduce an evaluation framework for\u0000testing affective cognition in foundation models. Starting from psychological\u0000theory, we generate 1,280 diverse scenarios exploring relationships between\u0000appraisals, emotions, expressions, and outcomes. We evaluate the abilities of\u0000foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across\u0000carefully selected conditions. Our results show foundation models tend to agree\u0000with human intuitions, matching or exceeding interparticipant agreement. In\u0000some conditions, models are ``superhuman'' -- they better predict modal human\u0000judgements than the average human. All models benefit from chain-of-thought\u0000reasoning. This suggests foundation models have acquired a human-like\u0000understanding of emotions and their influence on beliefs and behavior.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.
{"title":"TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning","authors":"Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan","doi":"arxiv-2409.11724","DOIUrl":"https://doi.org/arxiv-2409.11724","url":null,"abstract":"Current Large Language Models (LLMs) exhibit limited ability to understand\u0000table structures and to apply precise numerical reasoning, which is crucial for\u0000tasks such as table question answering (TQA) and table-based fact verification\u0000(TFV). To address these challenges, we introduce our Tool-Augmented Reasoning\u0000framework for Tables (TART), which integrates LLMs with specialized tools. TART\u0000contains three key components: a table formatter to ensure accurate data\u0000representation, a tool maker to develop specific computational tools, and an\u0000explanation generator to maintain explainability. We also present the TOOLTAB\u0000dataset, a new benchmark designed specifically for training LLMs in table-tool\u0000integration. Our experiments indicate that TART achieves substantial\u0000improvements over existing methods (e.g., Chain-of-Thought) by improving both\u0000the precision of data processing and the clarity of the reasoning process.\u0000Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the\u0000closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse\u0000real-world scenarios. All the code and data are available at\u0000https://github.com/XinyuanLu00/TART.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.
{"title":"Efficacy of Synthetic Data as a Benchmark","authors":"Gaurav Maheshwari, Dmitry Ivanov, Kevin El Haddad","doi":"arxiv-2409.11968","DOIUrl":"https://doi.org/arxiv-2409.11968","url":null,"abstract":"Large language models (LLMs) have enabled a range of applications in\u0000zero-shot and few-shot learning settings, including the generation of synthetic\u0000datasets for training and testing. However, to reliably use these synthetic\u0000datasets, it is essential to understand how representative they are of\u0000real-world data. We investigate this by assessing the effectiveness of\u0000generating synthetic data through LLM and using it as a benchmark for various\u0000NLP tasks. Our experiments across six datasets, and three different tasks, show\u0000that while synthetic data can effectively capture performance of various\u0000methods for simpler tasks, such as intent classification, it falls short for\u0000more complex tasks like named entity recognition. Additionally, we propose a\u0000new metric called the bias factor, which evaluates the biases introduced when\u0000the same LLM is used to both generate benchmarking data and to perform the\u0000tasks. We find that smaller LLMs exhibit biases towards their own generated\u0000data, whereas larger models do not. Overall, our findings suggest that the\u0000effectiveness of synthetic data as a benchmark varies depending on the task,\u0000and that practitioners should rely on data generated from multiple larger\u0000models whenever possible.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim
This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. In an effort to extend the focus of bias research beyond English-centric datasets, we have localized the content from the StereoSet, IndiBias, and Kamruzzaman et. al.'s datasets, producing a resource tailored to capture biases prevalent within the Bangla-speaking community. Our BanStereoSet dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty, beauty in profession, region, caste, and religion. This dataset not only serves as a crucial tool for measuring bias in multilingual LLMs but also facilitates the exploration of stereotypical bias across different social categories, potentially guiding the development of more equitable language technologies in Bangladeshi contexts. Our analysis of several language models using this dataset indicates significant biases, reinforcing the necessity for culturally and linguistically adapted datasets to develop more equitable language technologies.
{"title":"BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla","authors":"Mahammed Kamruzzaman, Abdullah Al Monsur, Shrabon Das, Enamul Hassan, Gene Louis Kim","doi":"arxiv-2409.11638","DOIUrl":"https://doi.org/arxiv-2409.11638","url":null,"abstract":"This study presents BanStereoSet, a dataset designed to evaluate\u0000stereotypical social biases in multilingual LLMs for the Bangla language. In an\u0000effort to extend the focus of bias research beyond English-centric datasets, we\u0000have localized the content from the StereoSet, IndiBias, and Kamruzzaman et.\u0000al.'s datasets, producing a resource tailored to capture biases prevalent\u0000within the Bangla-speaking community. Our BanStereoSet dataset consists of\u00001,194 sentences spanning 9 categories of bias: race, profession, gender,\u0000ageism, beauty, beauty in profession, region, caste, and religion. This dataset\u0000not only serves as a crucial tool for measuring bias in multilingual LLMs but\u0000also facilitates the exploration of stereotypical bias across different social\u0000categories, potentially guiding the development of more equitable language\u0000technologies in Bangladeshi contexts. Our analysis of several language models\u0000using this dataset indicates significant biases, reinforcing the necessity for\u0000culturally and linguistically adapted datasets to develop more equitable\u0000language technologies.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}