In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for $approx$ 50% of a list is different from when it accounts for $approx$ 70% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.
{"title":"Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities","authors":"Thomas Ball, Shuo Chen, Cormac Herley","doi":"arxiv-2409.07638","DOIUrl":"https://doi.org/arxiv-2409.07638","url":null,"abstract":"In this paper we explore evaluation of LLM capabilities. We present\u0000measurements of GPT-4 performance on several deterministic tasks; each task\u0000involves a basic calculation and takes as input parameter some element drawn\u0000from a large well-defined population (e.g., count elements in a list, multiply\u0000two k-digit numbers, etc). We examine several conditions per-task and perform\u0000enough trials so that statistically significant differences can be detected.\u0000This allows us to investigate the sensitivity of task-accuracy both to query\u0000phrasing and input parameter population. We find that seemingly trivial\u0000modifications in the task-prompt or input population can yield differences far\u0000larger than can be explained by sampling effects. For example, performance on a\u0000simple list-counting task varies with query-phrasing and list-length, but also\u0000with list composition (i.e., the thing-to-be-counted) and object frequency\u0000(e.g., success when an element accounts for $approx$ 50% of a list is\u0000different from when it accounts for $approx$ 70% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the\u0000language-as-fixed-effect fallacy, where experimental observations are\u0000improperly generalized beyond what the data supports. A consequence appears to\u0000be that intuitions that have been formed based on interactions with humans form\u0000a very unreliable guide as to which input modifications should ``make no\u0000difference'' to LLM performance.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents a method for verifying RDF triples using LLMs, with an emphasis on providing traceable arguments. Because the LLMs cannot currently reliably identify the origin of the information used to construct the response to the user query, our approach is to avoid using internal LLM factual knowledge altogether. Instead, verified RDF statements are compared to chunks of external documents retrieved through a web search or Wikipedia. To assess the possible application of this workflow on biosciences content, we evaluated 1,719 positive statements from the BioRED dataset and the same number of newly generated negative statements. The resulting precision is 88%, and recall is 44%. This indicates that the method requires human oversight. We demonstrate the method on Wikidata, where a SPARQL query is used to automatically retrieve statements needing verification. Overall, the results suggest that LLMs could be used for large-scale verification of statements in KGs, a task previously unfeasible due to human annotation costs.
{"title":"Traceable LLM-based validation of statements in knowledge graphs","authors":"Daniel Adam, Tomáš Kliegr","doi":"arxiv-2409.07507","DOIUrl":"https://doi.org/arxiv-2409.07507","url":null,"abstract":"This article presents a method for verifying RDF triples using LLMs, with an\u0000emphasis on providing traceable arguments. Because the LLMs cannot currently\u0000reliably identify the origin of the information used to construct the response\u0000to the user query, our approach is to avoid using internal LLM factual\u0000knowledge altogether. Instead, verified RDF statements are compared to chunks\u0000of external documents retrieved through a web search or Wikipedia. To assess\u0000the possible application of this workflow on biosciences content, we evaluated\u00001,719 positive statements from the BioRED dataset and the same number of newly\u0000generated negative statements. The resulting precision is 88%, and recall is\u000044%. This indicates that the method requires human oversight. We demonstrate\u0000the method on Wikidata, where a SPARQL query is used to automatically retrieve\u0000statements needing verification. Overall, the results suggest that LLMs could\u0000be used for large-scale verification of statements in KGs, a task previously\u0000unfeasible due to human annotation costs.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The demand for innovation in product design necessitates a prolific ideation phase. Conversational AI (CAI) systems that use Large Language Models (LLMs) such as GPT (Generative Pre-trained Transformer) have been shown to be fruitful in augmenting human creativity, providing numerous novel and diverse ideas. Despite the success in ideation quantity, the qualitative assessment of these ideas remains challenging and traditionally reliant on expert human evaluation. This method suffers from limitations such as human judgment errors, bias, and oversight. Addressing this gap, our study introduces a comprehensive mathematical framework for automated analysis to objectively evaluate the plethora of ideas generated by CAI systems and/or humans. This framework is particularly advantageous for novice designers who lack experience in selecting promising ideas. By converting the ideas into higher dimensional vectors and quantitatively measuring the diversity between them using tools such as UMAP, DBSCAN and PCA, the proposed method provides a reliable and objective way of selecting the most promising ideas, thereby enhancing the efficiency of the ideation phase.
产品设计中的创新需求需要一个多产的构思阶段。使用大型语言模型(LLM)(如GPT(生成预训练转换器))的会话式人工智能(CAI)系统在增强人类创造力方面取得了丰硕成果,提供了大量新颖多样的想法。尽管在构思数量上取得了成功,但对这些想法的定性评估仍具有挑战性,传统上依赖于专家人工评估。针对这一缺陷,我们的研究引入了一个全面的数学框架,用于自动分析,客观评估 CAI 系统和/或人类产生的大量创意。对于缺乏经验的新手设计者来说,这个框架尤其具有优势。通过将创意转换成高维向量,并使用 UMAP、DBSCAN 和 PCA 等工具定量测量它们之间的多样性,所提出的方法为选择最有前途的创意提供了一种可靠而客观的方法,从而提高了创意阶段的效率。
{"title":"A Novel Mathematical Framework for Objective Evaluation of Ideas using a Conversational AI (CAI) System","authors":"B. Sankar, Dibakar Sen","doi":"arxiv-2409.07578","DOIUrl":"https://doi.org/arxiv-2409.07578","url":null,"abstract":"The demand for innovation in product design necessitates a prolific ideation\u0000phase. Conversational AI (CAI) systems that use Large Language Models (LLMs)\u0000such as GPT (Generative Pre-trained Transformer) have been shown to be fruitful\u0000in augmenting human creativity, providing numerous novel and diverse ideas.\u0000Despite the success in ideation quantity, the qualitative assessment of these\u0000ideas remains challenging and traditionally reliant on expert human evaluation.\u0000This method suffers from limitations such as human judgment errors, bias, and\u0000oversight. Addressing this gap, our study introduces a comprehensive\u0000mathematical framework for automated analysis to objectively evaluate the\u0000plethora of ideas generated by CAI systems and/or humans. This framework is\u0000particularly advantageous for novice designers who lack experience in selecting\u0000promising ideas. By converting the ideas into higher dimensional vectors and\u0000quantitatively measuring the diversity between them using tools such as UMAP,\u0000DBSCAN and PCA, the proposed method provides a reliable and objective way of\u0000selecting the most promising ideas, thereby enhancing the efficiency of the\u0000ideation phase.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Bani-Harouni, Nassir Navab, Matthias Keicher
In emergency departments, rural hospitals, or clinics in less developed regions, clinicians often lack fast image analysis by trained radiologists, which can have a detrimental effect on patients' healthcare. Large Language Models (LLMs) have the potential to alleviate some pressure from these clinicians by providing insights that can help them in their decision-making. While these LLMs achieve high test results on medical exams showcasing their great theoretical medical knowledge, they tend not to follow medical guidelines. In this work, we introduce a new approach for zero-shot guideline-driven decision support. We model a system of multiple LLM agents augmented with a contrastive vision-language model that collaborate to reach a patient diagnosis. After providing the agents with simple diagnostic guidelines, they will synthesize prompts and screen the image for findings following these guidelines. Finally, they provide understandable chain-of-thought reasoning for their diagnosis, which is then self-refined to consider inter-dependencies between diseases. As our method is zero-shot, it is adaptable to settings with rare diseases, where training data is limited, but expert-crafted disease descriptions are available. We evaluate our method on two chest X-ray datasets, CheXpert and ChestX-ray 14 Longtail, showcasing performance improvement over existing zero-shot methods and generalizability to rare diseases.
{"title":"MAGDA: Multi-agent guideline-driven diagnostic assistance","authors":"David Bani-Harouni, Nassir Navab, Matthias Keicher","doi":"arxiv-2409.06351","DOIUrl":"https://doi.org/arxiv-2409.06351","url":null,"abstract":"In emergency departments, rural hospitals, or clinics in less developed\u0000regions, clinicians often lack fast image analysis by trained radiologists,\u0000which can have a detrimental effect on patients' healthcare. Large Language\u0000Models (LLMs) have the potential to alleviate some pressure from these\u0000clinicians by providing insights that can help them in their decision-making.\u0000While these LLMs achieve high test results on medical exams showcasing their\u0000great theoretical medical knowledge, they tend not to follow medical\u0000guidelines. In this work, we introduce a new approach for zero-shot\u0000guideline-driven decision support. We model a system of multiple LLM agents\u0000augmented with a contrastive vision-language model that collaborate to reach a\u0000patient diagnosis. After providing the agents with simple diagnostic\u0000guidelines, they will synthesize prompts and screen the image for findings\u0000following these guidelines. Finally, they provide understandable\u0000chain-of-thought reasoning for their diagnosis, which is then self-refined to\u0000consider inter-dependencies between diseases. As our method is zero-shot, it is\u0000adaptable to settings with rare diseases, where training data is limited, but\u0000expert-crafted disease descriptions are available. We evaluate our method on\u0000two chest X-ray datasets, CheXpert and ChestX-ray 14 Longtail, showcasing\u0000performance improvement over existing zero-shot methods and generalizability to\u0000rare diseases.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Numerous techniques of multi-criteria decision-making (MCDM) have been proposed in a variety of business domains. One of the well-known methods is the Analytical Hierarchical Process (AHP). Various uncertain numbers are commonly used to represent preference values in AHP problems. In the case of multi-granularity linguistic information, several methods have been proposed to address this type of AHP problem. This paper introduces a novel method to solve this problem using shadowed fuzzy numbers (SFNs). These numbers are characterized by approximating different types of fuzzy numbers and preserving their uncertainty properties. The new Shadowed AHP method is proposed to handle preference values which are represented by multi-types of uncertain numbers. The new approach converts multi-granular preference values into unified model of shadowed fuzzy numbers and utilizes their properties. A new ranking approach is introduced to order the results of aggregation preferences. The new approach is applied to solve a supplier selection problem in which multi-granular information are used. The features of the new approach are significant for decision-making applications.
{"title":"Shadowed AHP for multi-criteria supplier selection","authors":"Mohamed Abdel Hameed El-Hawy","doi":"arxiv-2409.09082","DOIUrl":"https://doi.org/arxiv-2409.09082","url":null,"abstract":"Numerous techniques of multi-criteria decision-making (MCDM) have been\u0000proposed in a variety of business domains. One of the well-known methods is the\u0000Analytical Hierarchical Process (AHP). Various uncertain numbers are commonly\u0000used to represent preference values in AHP problems. In the case of\u0000multi-granularity linguistic information, several methods have been proposed to\u0000address this type of AHP problem. This paper introduces a novel method to solve\u0000this problem using shadowed fuzzy numbers (SFNs). These numbers are\u0000characterized by approximating different types of fuzzy numbers and preserving\u0000their uncertainty properties. The new Shadowed AHP method is proposed to handle\u0000preference values which are represented by multi-types of uncertain numbers.\u0000The new approach converts multi-granular preference values into unified model\u0000of shadowed fuzzy numbers and utilizes their properties. A new ranking approach\u0000is introduced to order the results of aggregation preferences. The new approach\u0000is applied to solve a supplier selection problem in which multi-granular\u0000information are used. The features of the new approach are significant for\u0000decision-making applications.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Wes Bethel, Vianna Cramer, Alexander del Rio, Lothar Narins, Chris Pestano, Satvik Verma, Erick Arias, Nicola Bertelli, Talita Perciano, Syun'ichi Shiraiwa, Álvaro Sánchez Villar, Greg Wallace, John C. Wright
This work presents a detailed case study on using Generative AI (GenAI) to develop AI surrogates for simulation models in fusion energy research. The scope includes the methodology, implementation, and results of using GenAI to assist in model development and optimization, comparing these results with previous manually developed models.
{"title":"Case Study: Leveraging GenAI to Build AI-based Surrogates and Regressors for Modeling Radio Frequency Heating in Fusion Energy Science","authors":"E. Wes Bethel, Vianna Cramer, Alexander del Rio, Lothar Narins, Chris Pestano, Satvik Verma, Erick Arias, Nicola Bertelli, Talita Perciano, Syun'ichi Shiraiwa, Álvaro Sánchez Villar, Greg Wallace, John C. Wright","doi":"arxiv-2409.06122","DOIUrl":"https://doi.org/arxiv-2409.06122","url":null,"abstract":"This work presents a detailed case study on using Generative AI (GenAI) to\u0000develop AI surrogates for simulation models in fusion energy research. The\u0000scope includes the methodology, implementation, and results of using GenAI to\u0000assist in model development and optimization, comparing these results with\u0000previous manually developed models.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Explaining the strength of arguments under gradual semantics is receiving increasing attention. For example, various studies in the literature offer explanations by computing the attribution scores of arguments or edges in Quantitative Bipolar Argumentation Frameworks (QBAFs). These explanations, known as Argument Attribution Explanations (AAEs) and Relation Attribution Explanations (RAEs), commonly employ removal-based and Shapley-based techniques for computing the attribution scores. While AAEs and RAEs have proven useful in several applications with acyclic QBAFs, they remain largely unexplored for cyclic QBAFs. Furthermore, existing applications tend to focus solely on either AAEs or RAEs, but do not compare them directly. In this paper, we apply both AAEs and RAEs, to Truth Discovery QBAFs (TD-QBAFs), which assess the trustworthiness of sources (e.g., websites) and their claims (e.g., the severity of a virus), and feature complex cycles. We find that both AAEs and RAEs can provide interesting explanations and can give non-trivial and surprising insights.
{"title":"Applying Attribution Explanations in Truth-Discovery Quantitative Bipolar Argumentation Frameworks","authors":"Xiang Yin, Nico Potyka, Francesca Toni","doi":"arxiv-2409.05831","DOIUrl":"https://doi.org/arxiv-2409.05831","url":null,"abstract":"Explaining the strength of arguments under gradual semantics is receiving\u0000increasing attention. For example, various studies in the literature offer\u0000explanations by computing the attribution scores of arguments or edges in\u0000Quantitative Bipolar Argumentation Frameworks (QBAFs). These explanations,\u0000known as Argument Attribution Explanations (AAEs) and Relation Attribution\u0000Explanations (RAEs), commonly employ removal-based and Shapley-based techniques\u0000for computing the attribution scores. While AAEs and RAEs have proven useful in\u0000several applications with acyclic QBAFs, they remain largely unexplored for\u0000cyclic QBAFs. Furthermore, existing applications tend to focus solely on either\u0000AAEs or RAEs, but do not compare them directly. In this paper, we apply both\u0000AAEs and RAEs, to Truth Discovery QBAFs (TD-QBAFs), which assess the\u0000trustworthiness of sources (e.g., websites) and their claims (e.g., the\u0000severity of a virus), and feature complex cycles. We find that both AAEs and\u0000RAEs can provide interesting explanations and can give non-trivial and\u0000surprising insights.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-FL), which which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, prior to local training on local datasets of clients, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of multimodal large language models. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.
{"title":"MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data","authors":"Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li","doi":"arxiv-2409.06067","DOIUrl":"https://doi.org/arxiv-2409.06067","url":null,"abstract":"Previous studies on federated learning (FL) often encounter performance\u0000degradation due to data heterogeneity among different clients. In light of the\u0000recent advances in multimodal large language models (MLLMs), such as GPT-4v and\u0000LLaVA, which demonstrate their exceptional proficiency in multimodal tasks,\u0000such as image captioning and multimodal question answering. We introduce a\u0000novel federated learning framework, named Multimodal Large Language Model\u0000Assisted Federated Learning (MLLM-FL), which which employs powerful MLLMs at\u0000the server end to address the heterogeneous and long-tailed challenges. Owing\u0000to the advanced cross-modality representation capabilities and the extensive\u0000open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing\u0000the extensive, yet previously underexploited, open-source data accessible from\u0000websites and powerful server-side computational resources. Hence, the MLLM-FL\u0000not only enhances the performance but also avoids increasing the risk of\u0000privacy leakage and the computational burden on local devices, distinguishing\u0000it from prior methodologies. Our framework has three key stages. Initially,\u0000prior to local training on local datasets of clients, we conduct global\u0000visual-text pretraining of the model. This pretraining is facilitated by\u0000utilizing the extensive open-source data available online, with the assistance\u0000of multimodal large language models. Subsequently, the pretrained model is\u0000distributed among various clients for local training. Finally, once the locally\u0000trained models are transmitted back to the server, a global alignment is\u0000carried out under the supervision of MLLMs to further enhance the performance.\u0000Experimental evaluations on established benchmarks, show that our framework\u0000delivers promising performance in the typical scenarios with data heterogeneity\u0000and long-tail distribution across different clients in FL.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"156 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative AI has made remarkable progress in addressing various design challenges. One prominent area where generative AI could bring significant value is in engineering design. In particular, selecting an optimal set of components and their interfaces to create a mechanical system that meets design requirements is one of the most challenging and time-consuming tasks for engineers. This configuration design task is inherently challenging due to its categorical nature, multiple design requirements a solution must satisfy, and the reliance on physics simulations for evaluating potential solutions. These characteristics entail solving a combinatorial optimization problem with multiple constraints involving black-box functions. To address this challenge, we propose a deep generative model to predict the optimal combination of components and interfaces for a given design problem. To demonstrate our approach, we solve a gear train synthesis problem by first creating a synthetic dataset using a grammar, a parts catalogue, and a physics simulator. We then train a Transformer using this dataset, named GearFormer, which can not only generate quality solutions on its own, but also augment search methods such as an evolutionary algorithm and Monte Carlo tree search. We show that GearFormer outperforms such search methods on their own in terms of satisfying the specified design requirements with orders of magnitude faster generation time. Additionally, we showcase the benefit of hybrid methods that leverage both GearFormer and search methods, which further improve the quality of the solutions.
{"title":"Deep Generative Model for Mechanical System Configuration Design","authors":"Yasaman Etesam, Hyunmin Cheong, Mohammadmehdi Ataei, Pradeep Kumar Jayaraman","doi":"arxiv-2409.06016","DOIUrl":"https://doi.org/arxiv-2409.06016","url":null,"abstract":"Generative AI has made remarkable progress in addressing various design\u0000challenges. One prominent area where generative AI could bring significant\u0000value is in engineering design. In particular, selecting an optimal set of\u0000components and their interfaces to create a mechanical system that meets design\u0000requirements is one of the most challenging and time-consuming tasks for\u0000engineers. This configuration design task is inherently challenging due to its\u0000categorical nature, multiple design requirements a solution must satisfy, and\u0000the reliance on physics simulations for evaluating potential solutions. These\u0000characteristics entail solving a combinatorial optimization problem with\u0000multiple constraints involving black-box functions. To address this challenge,\u0000we propose a deep generative model to predict the optimal combination of\u0000components and interfaces for a given design problem. To demonstrate our\u0000approach, we solve a gear train synthesis problem by first creating a synthetic\u0000dataset using a grammar, a parts catalogue, and a physics simulator. We then\u0000train a Transformer using this dataset, named GearFormer, which can not only\u0000generate quality solutions on its own, but also augment search methods such as\u0000an evolutionary algorithm and Monte Carlo tree search. We show that GearFormer\u0000outperforms such search methods on their own in terms of satisfying the\u0000specified design requirements with orders of magnitude faster generation time.\u0000Additionally, we showcase the benefit of hybrid methods that leverage both\u0000GearFormer and search methods, which further improve the quality of the\u0000solutions.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reinforcement Learning (RL) is a learning paradigm in which the agent learns from its environment through trial and error. Deep reinforcement learning (DRL) algorithms represent the agent's policies using neural networks, making their decisions difficult to interpret. Explaining the behaviour of DRL agents is necessary to advance user trust, increase engagement, and facilitate integration with real-life tasks. Semifactual explanations aim to explain an outcome by providing "even if" scenarios, such as "even if the car were moving twice as slowly, it would still have to swerve to avoid crashing". Semifactuals help users understand the effects of different factors on the outcome and support the optimisation of resources. While extensively studied in psychology and even utilised in supervised learning, semifactuals have not been used to explain the decisions of RL systems. In this work, we develop a first approach to generating semifactual explanations for RL agents. We start by defining five properties of desirable semifactual explanations in RL and then introducing SGRL-Rewind and SGRL-Advance, the first algorithms for generating semifactual explanations in RL. We evaluate the algorithms in two standard RL environments and find that they generate semifactuals that are easier to reach, represent the agent's policy better, and are more diverse compared to baselines. Lastly, we conduct and analyse a user study to assess the participant's perception of semifactual explanations of the agent's actions.
{"title":"Semifactual Explanations for Reinforcement Learning","authors":"Jasmina Gajcin, Jovan Jeromela, Ivana Dusparic","doi":"arxiv-2409.05435","DOIUrl":"https://doi.org/arxiv-2409.05435","url":null,"abstract":"Reinforcement Learning (RL) is a learning paradigm in which the agent learns\u0000from its environment through trial and error. Deep reinforcement learning (DRL)\u0000algorithms represent the agent's policies using neural networks, making their\u0000decisions difficult to interpret. Explaining the behaviour of DRL agents is\u0000necessary to advance user trust, increase engagement, and facilitate\u0000integration with real-life tasks. Semifactual explanations aim to explain an\u0000outcome by providing \"even if\" scenarios, such as \"even if the car were moving\u0000twice as slowly, it would still have to swerve to avoid crashing\". Semifactuals\u0000help users understand the effects of different factors on the outcome and\u0000support the optimisation of resources. While extensively studied in psychology\u0000and even utilised in supervised learning, semifactuals have not been used to\u0000explain the decisions of RL systems. In this work, we develop a first approach\u0000to generating semifactual explanations for RL agents. We start by defining five\u0000properties of desirable semifactual explanations in RL and then introducing\u0000SGRL-Rewind and SGRL-Advance, the first algorithms for generating semifactual\u0000explanations in RL. We evaluate the algorithms in two standard RL environments\u0000and find that they generate semifactuals that are easier to reach, represent\u0000the agent's policy better, and are more diverse compared to baselines. Lastly,\u0000we conduct and analyse a user study to assess the participant's perception of\u0000semifactual explanations of the agent's actions.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}