Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang
Unified information extraction (UIE) aims to complete all information extraction tasks using a single model or framework. While previous work has primarily focused on instruction-tuning large language models (LLMs) with constructed datasets, these methods require significant computational resources and struggle to generalize to unseen tasks. To address these limitations, we propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization while reducing computational costs. The key challenge in RUIE is selecting the most beneficial demonstrations for LLMs to effectively handle diverse IE tasks. To achieve this, we integrate LLM preferences for ranking candidate demonstrations and design a keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations. We then train a bi-encoder retriever for UIE through contrastive learning and knowledge distillation. To the best of our knowledge, RUIE is the first trainable retrieval framework for UIE. Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in generalizing to unseen tasks, with average F1-score improvements of 19.22 and 3.13 compared to instruction-tuning methods and other retrievers, respectively. Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the importance of its key components.
{"title":"RUIE: Retrieval-based Unified Information Extraction using Large Language Model","authors":"Xincheng Liao, Junwen Duan, Yixi Huang, Jianxin Wang","doi":"arxiv-2409.11673","DOIUrl":"https://doi.org/arxiv-2409.11673","url":null,"abstract":"Unified information extraction (UIE) aims to complete all information\u0000extraction tasks using a single model or framework. While previous work has\u0000primarily focused on instruction-tuning large language models (LLMs) with\u0000constructed datasets, these methods require significant computational resources\u0000and struggle to generalize to unseen tasks. To address these limitations, we\u0000propose RUIE (Retrieval-based Unified Information Extraction), a framework that\u0000leverages in-context learning to enable rapid generalization while reducing\u0000computational costs. The key challenge in RUIE is selecting the most beneficial\u0000demonstrations for LLMs to effectively handle diverse IE tasks. To achieve\u0000this, we integrate LLM preferences for ranking candidate demonstrations and\u0000design a keyword-enhanced reward model to capture fine-grained relationships\u0000between queries and demonstrations. We then train a bi-encoder retriever for\u0000UIE through contrastive learning and knowledge distillation. To the best of our\u0000knowledge, RUIE is the first trainable retrieval framework for UIE.\u0000Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in\u0000generalizing to unseen tasks, with average F1-score improvements of 19.22 and\u00003.13 compared to instruction-tuning methods and other retrievers, respectively.\u0000Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the\u0000importance of its key components.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe's RMs and multi-agent communication.
{"title":"MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning","authors":"Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal","doi":"arxiv-2409.12147","DOIUrl":"https://doi.org/arxiv-2409.12147","url":null,"abstract":"Large Language Models' (LLM) reasoning can be improved using test-time\u0000aggregation strategies, i.e., generating multiple samples and voting among\u0000generated samples. While these improve performance, they often reach a\u0000saturation point. Refinement offers an alternative by using LLM-generated\u0000feedback to improve solution quality. However, refinement introduces 3 key\u0000challenges: (1) Excessive refinement: Uniformly refining all instances can\u0000over-correct and reduce the overall performance. (2) Inability to localize and\u0000address errors: LLMs have a limited ability to self-correct and struggle to\u0000identify and correct their own mistakes. (3) Insufficient refinement: Deciding\u0000how many iterations of refinement are needed is non-trivial, and stopping too\u0000soon could leave errors unaddressed. To tackle these issues, we propose\u0000MAgICoRe, which avoids excessive refinement by categorizing problem difficulty\u0000as easy or hard, solving easy problems with coarse-grained aggregation and hard\u0000ones with fine-grained and iterative multi-agent refinement. To improve error\u0000localization, we incorporate external step-wise reward model (RM) scores.\u0000Moreover, to ensure effective refinement, we employ a multi-agent loop with\u0000three agents: Solver, Reviewer (which generates targeted feedback based on\u0000step-wise RM scores), and the Refiner (which incorporates feedback). To ensure\u0000sufficient refinement, we re-evaluate updated solutions, iteratively initiating\u0000further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5\u0000and show its effectiveness across 5 math datasets. Even one iteration of\u0000MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by\u00004.0% while using less than half the samples. Unlike iterative refinement with\u0000baselines, MAgICoRe continues to improve with more iterations. Finally, our\u0000ablations highlight the importance of MAgICoRe's RMs and multi-agent\u0000communication.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The role of large language models (LLMs) in education is an increasing area of interest today, considering the new opportunities they offer for teaching, learning, and assessment. This cutting-edge tutorial provides an overview of the educational applications of NLP and the impact that the recent advances in LLMs have had on this field. We will discuss the key challenges and opportunities presented by LLMs, grounding them in the context of four major educational applications: reading, writing, and speaking skills, and intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for researchers and practitioners interested in the educational applications of NLP and the role LLMs have to play in this area. It is the first of its kind to address this timely topic.
{"title":"LLMs in Education: Novel Perspectives, Challenges, and Opportunities","authors":"Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar","doi":"arxiv-2409.11917","DOIUrl":"https://doi.org/arxiv-2409.11917","url":null,"abstract":"The role of large language models (LLMs) in education is an increasing area\u0000of interest today, considering the new opportunities they offer for teaching,\u0000learning, and assessment. This cutting-edge tutorial provides an overview of\u0000the educational applications of NLP and the impact that the recent advances in\u0000LLMs have had on this field. We will discuss the key challenges and\u0000opportunities presented by LLMs, grounding them in the context of four major\u0000educational applications: reading, writing, and speaking skills, and\u0000intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for\u0000researchers and practitioners interested in the educational applications of NLP\u0000and the role LLMs have to play in this area. It is the first of its kind to\u0000address this timely topic.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
{"title":"To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning","authors":"Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett","doi":"arxiv-2409.12183","DOIUrl":"https://doi.org/arxiv-2409.12183","url":null,"abstract":"Chain-of-thought (CoT) via prompting is the de facto method for eliciting\u0000reasoning capabilities from large language models (LLMs). But for what kinds of\u0000tasks is this extra ``thinking'' really helpful? To analyze this, we conducted\u0000a quantitative meta-analysis covering over 100 papers using CoT and ran our own\u0000evaluations of 20 datasets across 14 models. Our results show that CoT gives\u0000strong performance benefits primarily on tasks involving math or logic, with\u0000much smaller gains on other types of tasks. On MMLU, directly generating the\u0000answer without CoT leads to almost identical accuracy as CoT unless the\u0000question or model's response contains an equals sign, indicating symbolic\u0000operations and reasoning. Following this finding, we analyze the behavior of\u0000CoT on these problems by separating planning and execution and comparing\u0000against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic\u0000execution, but it underperforms relative to using a symbolic solver. Our\u0000results indicate that CoT can be applied selectively, maintaining performance\u0000while saving inference costs. Furthermore, they suggest a need to move beyond\u0000prompt-based CoT to new paradigms that better leverage intermediate computation\u0000across the whole range of LLM applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.
{"title":"Qwen2.5-Coder Technical Report","authors":"Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12186","DOIUrl":"https://doi.org/arxiv-2409.12186","url":null,"abstract":"In this report, we introduce the Qwen2.5-Coder series, a significant upgrade\u0000from its predecessor, CodeQwen1.5. This series includes two models:\u0000Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model,\u0000Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained\u0000on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning,\u0000scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder\u0000demonstrates impressive code generation capabilities while retaining general\u0000versatility. The model has been evaluated on a wide range of code-related\u0000tasks, achieving state-of-the-art (SOTA) performance across more than 10\u0000benchmarks, including code generation, completion, reasoning, and repair,\u0000consistently outperforming larger models of the same model size. We believe\u0000that the release of the Qwen2.5-Coder series will not only push the boundaries\u0000of research in code intelligence but also, through its permissive licensing,\u0000encourage broader adoption by developers in real-world applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
{"title":"GRIN: GRadient-INformed MoE","authors":"Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen","doi":"arxiv-2409.12136","DOIUrl":"https://doi.org/arxiv-2409.12136","url":null,"abstract":"Mixture-of-Experts (MoE) models scale more effectively than dense models due\u0000to sparse computation through expert routing, selectively activating only a\u0000small subset of expert modules. However, sparse computation challenges\u0000traditional training practices, as discrete expert routing hinders standard\u0000backpropagation and thus gradient-based optimization, which are the cornerstone\u0000of deep learning. To better pursue the scaling power of MoE, we introduce GRIN\u0000(GRadient-INformed MoE training), which incorporates sparse gradient estimation\u0000for expert routing and configures model parallelism to avoid token dropping.\u0000Applying GRIN to autoregressive language modeling, we develop a top-2\u000016$times$3.8B MoE model. Our model, with only 6.6B activated parameters,\u0000outperforms a 7B dense model and matches the performance of a 14B dense model\u0000trained on the same data. Extensive evaluations across diverse tasks\u0000demonstrate the potential of GRIN to significantly enhance MoE efficacy,\u0000achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
{"title":"A Controlled Study on Long Context Extension and Generalization in LLMs","authors":"Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush","doi":"arxiv-2409.12181","DOIUrl":"https://doi.org/arxiv-2409.12181","url":null,"abstract":"Broad textual understanding and in-context learning require language models\u0000that utilize full document contexts. Due to the implementation challenges\u0000associated with directly training long-context models, many methods have been\u0000proposed for extending models to handle long contexts. However, owing to\u0000differences in data and model classes, it has been challenging to compare these\u0000approaches, leading to uncertainty as to how to evaluate long-context\u0000performance and whether it differs from standard evaluation. We implement a\u0000controlled protocol for extension methods with a standardized evaluation,\u0000utilizing consistent base models and extension data. Our study yields several\u0000insights into long-context behavior. First, we reaffirm the critical role of\u0000perplexity as a general-purpose performance indicator even in longer-context\u0000tasks. Second, we find that current approximate attention methods\u0000systematically underperform across long-context tasks. Finally, we confirm that\u0000exact fine-tuning based methods are generally effective within the range of\u0000their extension, whereas extrapolation remains challenging. All codebases,\u0000models, and checkpoints will be made available open-source, promoting\u0000transparency and facilitating further research in this critical area of AI\u0000development.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng
While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO's competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
{"title":"You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL","authors":"Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng","doi":"arxiv-2409.12172","DOIUrl":"https://doi.org/arxiv-2409.12172","url":null,"abstract":"While significant progress has been made on the text-to-SQL task, recent\u0000solutions repeatedly encode the same database schema for every question,\u0000resulting in unnecessary high inference cost and often overlooking crucial\u0000database knowledge. To address these issues, we propose You Only Read Once\u0000(YORO), a novel paradigm that directly internalizes database knowledge into the\u0000parametric knowledge of a text-to-SQL model during training and eliminates the\u0000need for schema encoding during inference. YORO significantly reduces the input\u0000token length by 66%-98%. Despite its shorter inputs, our empirical results\u0000demonstrate YORO's competitive performances with traditional systems on three\u0000benchmarks as well as its significant outperformance on large databases.\u0000Furthermore, YORO excels in handling questions with challenging value\u0000retrievals such as abbreviation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
{"title":"Human-like Affective Cognition in Foundation Models","authors":"Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C. Ong, Noah D. Goodman","doi":"arxiv-2409.11733","DOIUrl":"https://doi.org/arxiv-2409.11733","url":null,"abstract":"Understanding emotions is fundamental to human interaction and experience.\u0000Humans easily infer emotions from situations or facial expressions, situations\u0000from emotions, and do a variety of other emph{affective cognition}. How adept\u0000is modern AI at these inferences? We introduce an evaluation framework for\u0000testing affective cognition in foundation models. Starting from psychological\u0000theory, we generate 1,280 diverse scenarios exploring relationships between\u0000appraisals, emotions, expressions, and outcomes. We evaluate the abilities of\u0000foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across\u0000carefully selected conditions. Our results show foundation models tend to agree\u0000with human intuitions, matching or exceeding interparticipant agreement. In\u0000some conditions, models are ``superhuman'' -- they better predict modal human\u0000judgements than the average human. All models benefit from chain-of-thought\u0000reasoning. This suggests foundation models have acquired a human-like\u0000understanding of emotions and their influence on beliefs and behavior.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.
{"title":"TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning","authors":"Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan","doi":"arxiv-2409.11724","DOIUrl":"https://doi.org/arxiv-2409.11724","url":null,"abstract":"Current Large Language Models (LLMs) exhibit limited ability to understand\u0000table structures and to apply precise numerical reasoning, which is crucial for\u0000tasks such as table question answering (TQA) and table-based fact verification\u0000(TFV). To address these challenges, we introduce our Tool-Augmented Reasoning\u0000framework for Tables (TART), which integrates LLMs with specialized tools. TART\u0000contains three key components: a table formatter to ensure accurate data\u0000representation, a tool maker to develop specific computational tools, and an\u0000explanation generator to maintain explainability. We also present the TOOLTAB\u0000dataset, a new benchmark designed specifically for training LLMs in table-tool\u0000integration. Our experiments indicate that TART achieves substantial\u0000improvements over existing methods (e.g., Chain-of-Thought) by improving both\u0000the precision of data processing and the clarity of the reasoning process.\u0000Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the\u0000closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse\u0000real-world scenarios. All the code and data are available at\u0000https://github.com/XinyuanLu00/TART.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}