We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information (e.g., task descriptions) and assist users by answering questions or auto-completing contents, autopilot systems must complete tasks from start to finish independently, which requires the system to acquire the state information from the environments actively. To achieve this, an autopilot system should be capable of understanding user intents, actively gathering necessary information from various real-world sources, and making wise decisions. Cognitive Kernel adopts a model-centric design. In our implementation, the central policy model (a fine-tuned LLM) initiates interactions with the environment using a combination of atomic actions, such as opening files, clicking buttons, saving intermediate results to memory, or calling the LLM itself. This differs from the widely used environment-centric design, where a task-specific environment with predefined actions is fixed, and the policy model is limited to selecting the correct action from a given set of options. Our design facilitates seamless information flow across various sources and provides greater flexibility. We evaluate our system in three use cases: real-time information management, private information management, and long-term memory management. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems in these scenarios. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system and the backbone model to encourage further research on LLM-driven autopilot systems.
{"title":"Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots","authors":"Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, Dong Yu","doi":"arxiv-2409.10277","DOIUrl":"https://doi.org/arxiv-2409.10277","url":null,"abstract":"We introduce Cognitive Kernel, an open-source agent system towards the goal\u0000of generalist autopilots. Unlike copilot systems, which primarily rely on users\u0000to provide essential state information (e.g., task descriptions) and assist\u0000users by answering questions or auto-completing contents, autopilot systems\u0000must complete tasks from start to finish independently, which requires the\u0000system to acquire the state information from the environments actively. To\u0000achieve this, an autopilot system should be capable of understanding user\u0000intents, actively gathering necessary information from various real-world\u0000sources, and making wise decisions. Cognitive Kernel adopts a model-centric\u0000design. In our implementation, the central policy model (a fine-tuned LLM)\u0000initiates interactions with the environment using a combination of atomic\u0000actions, such as opening files, clicking buttons, saving intermediate results\u0000to memory, or calling the LLM itself. This differs from the widely used\u0000environment-centric design, where a task-specific environment with predefined\u0000actions is fixed, and the policy model is limited to selecting the correct\u0000action from a given set of options. Our design facilitates seamless information\u0000flow across various sources and provides greater flexibility. We evaluate our\u0000system in three use cases: real-time information management, private\u0000information management, and long-term memory management. The results\u0000demonstrate that Cognitive Kernel achieves better or comparable performance to\u0000other closed-source systems in these scenarios. Cognitive Kernel is fully\u0000dockerized, ensuring everyone can deploy it privately and securely. We\u0000open-source the system and the backbone model to encourage further research on\u0000LLM-driven autopilot systems.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent developments in language models have created new opportunities in air traffic control studies. The current focus is primarily on text and language-based use cases. However, these language models may offer a higher potential impact in the air traffic control domain, thanks to their ability to interact with air traffic environments in an embodied agent form. They also provide a language-like reasoning capability to explain their decisions, which has been a significant roadblock for the implementation of automatic air traffic control. This paper investigates the application of a language model-based agent with function-calling and learning capabilities to resolve air traffic conflicts without human intervention. The main components of this research are foundational large language models, tools that allow the agent to interact with the simulator, and a new concept, the experience library. An innovative part of this research, the experience library, is a vector database that stores synthesized knowledge that agents have learned from interactions with the simulations and language models. To evaluate the performance of our language model-based agent, both open-source and closed-source models were tested. The results of our study reveal significant differences in performance across various configurations of the language model-based agents. The best-performing configuration was able to solve almost all 120 but one imminent conflict scenarios, including up to four aircraft at the same time. Most importantly, the agents are able to provide human-level text explanations on traffic situations and conflict resolution strategies.
{"title":"Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents","authors":"Justas Andriuškevičius, Junzi Sun","doi":"arxiv-2409.09717","DOIUrl":"https://doi.org/arxiv-2409.09717","url":null,"abstract":"Recent developments in language models have created new opportunities in air\u0000traffic control studies. The current focus is primarily on text and\u0000language-based use cases. However, these language models may offer a higher\u0000potential impact in the air traffic control domain, thanks to their ability to\u0000interact with air traffic environments in an embodied agent form. They also\u0000provide a language-like reasoning capability to explain their decisions, which\u0000has been a significant roadblock for the implementation of automatic air\u0000traffic control. This paper investigates the application of a language model-based agent with\u0000function-calling and learning capabilities to resolve air traffic conflicts\u0000without human intervention. The main components of this research are\u0000foundational large language models, tools that allow the agent to interact with\u0000the simulator, and a new concept, the experience library. An innovative part of\u0000this research, the experience library, is a vector database that stores\u0000synthesized knowledge that agents have learned from interactions with the\u0000simulations and language models. To evaluate the performance of our language model-based agent, both\u0000open-source and closed-source models were tested. The results of our study\u0000reveal significant differences in performance across various configurations of\u0000the language model-based agents. The best-performing configuration was able to\u0000solve almost all 120 but one imminent conflict scenarios, including up to four\u0000aircraft at the same time. Most importantly, the agents are able to provide\u0000human-level text explanations on traffic situations and conflict resolution\u0000strategies.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
{"title":"Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison","authors":"Judy Hanwen Shen, Archit Sharma, Jun Qin","doi":"arxiv-2409.09603","DOIUrl":"https://doi.org/arxiv-2409.09603","url":null,"abstract":"The goal of aligning language models to human preferences requires data that\u0000reveal these preferences. Ideally, time and money can be spent carefully\u0000collecting and tailoring bespoke preference data to each downstream\u0000application. However, in practice, a select few publicly available preference\u0000datasets are often used to train reward models for reinforcement learning from\u0000human feedback (RLHF). While new preference datasets are being introduced with\u0000increasing frequency, there are currently no existing efforts to measure and\u0000compare these datasets. In this paper, we systematically study preference\u0000datasets through three perspectives: scale, label noise, and information\u0000content. We propose specific metrics for each of these perspectives and uncover\u0000different axes of comparison for a better understanding of preference datasets.\u0000Our work is a first step towards a data-centric approach to alignment by\u0000providing perspectives that aid in training efficiency and iterative data\u0000collection for RLHF.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang
Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.
{"title":"Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models","authors":"Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang","doi":"arxiv-2409.09345","DOIUrl":"https://doi.org/arxiv-2409.09345","url":null,"abstract":"Agents significantly enhance the capabilities of standalone Large Language\u0000Models (LLMs) by perceiving environments, making decisions, and executing\u0000actions. However, LLM agents still face challenges in tasks that require\u0000multiple decision-making steps. Estimating the value of actions in specific\u0000tasks is difficult when intermediate actions are neither appropriately rewarded\u0000nor penalized. In this paper, we propose leveraging a task-relevant Q-value\u0000model to guide action selection. Specifically, we first collect decision-making\u0000trajectories annotated with step-level Q values via Monte Carlo Tree Search\u0000(MCTS) and construct preference data. We then use another LLM to fit these\u0000preferences through step-level Direct Policy Optimization (DPO), which serves\u0000as the Q-value model. During inference, at each decision-making step, LLM\u0000agents select the action with the highest Q value before interacting with the\u0000environment. We apply our method to various open-source and API-based LLM\u0000agents, demonstrating that Q-value models significantly improve their\u0000performance. Notably, the performance of the agent built with\u0000Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when\u0000enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally,\u0000Q-value models offer several advantages, such as generalization to different\u0000LLM agents and seamless integration with existing prompting strategies.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reinforcement Learning has revolutionized decision-making processes in dynamic environments, yet it often struggles with autonomously detecting and achieving goals without clear feedback signals. For example, in a Source Term Estimation problem, the lack of precise environmental information makes it challenging to provide clear feedback signals and to define and evaluate how the source's location is determined. To address this challenge, the Autonomous Goal Detection and Cessation (AGDC) module was developed, enhancing various RL algorithms by incorporating a self-feedback mechanism for autonomous goal detection and cessation upon task completion. Our method effectively identifies and ceases undefined goals by approximating the agent's belief, significantly enhancing the capabilities of RL algorithms in environments with limited feedback. To validate effectiveness of our approach, we integrated AGDC with deep Q-Network, proximal policy optimization, and deep deterministic policy gradient algorithms, and evaluated its performance on the Source Term Estimation problem. The experimental results showed that AGDC-enhanced RL algorithms significantly outperformed traditional statistical methods such as infotaxis, entrotaxis, and dual control for exploitation and exploration, as well as a non-statistical random action selection method. These improvements were evident in terms of success rate, mean traveled distance, and search time, highlighting AGDC's effectiveness and efficiency in complex, real-world scenarios.
{"title":"Autonomous Goal Detection and Cessation in Reinforcement Learning: A Case Study on Source Term Estimation","authors":"Yiwei Shi, Muning Wen, Qi Zhang, Weinan Zhang, Cunjia Liu, Weiru Liu","doi":"arxiv-2409.09541","DOIUrl":"https://doi.org/arxiv-2409.09541","url":null,"abstract":"Reinforcement Learning has revolutionized decision-making processes in\u0000dynamic environments, yet it often struggles with autonomously detecting and\u0000achieving goals without clear feedback signals. For example, in a Source Term\u0000Estimation problem, the lack of precise environmental information makes it\u0000challenging to provide clear feedback signals and to define and evaluate how\u0000the source's location is determined. To address this challenge, the Autonomous\u0000Goal Detection and Cessation (AGDC) module was developed, enhancing various RL\u0000algorithms by incorporating a self-feedback mechanism for autonomous goal\u0000detection and cessation upon task completion. Our method effectively identifies\u0000and ceases undefined goals by approximating the agent's belief, significantly\u0000enhancing the capabilities of RL algorithms in environments with limited\u0000feedback. To validate effectiveness of our approach, we integrated AGDC with\u0000deep Q-Network, proximal policy optimization, and deep deterministic policy\u0000gradient algorithms, and evaluated its performance on the Source Term\u0000Estimation problem. The experimental results showed that AGDC-enhanced RL\u0000algorithms significantly outperformed traditional statistical methods such as\u0000infotaxis, entrotaxis, and dual control for exploitation and exploration, as\u0000well as a non-statistical random action selection method. These improvements\u0000were evident in terms of success rate, mean traveled distance, and search time,\u0000highlighting AGDC's effectiveness and efficiency in complex, real-world\u0000scenarios.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Job Shop Scheduling Problem (JSP) is central to operations research, primarily optimizing energy efficiency due to its profound environmental and economic implications. Efficient scheduling enhances production metrics and mitigates energy consumption, thus effectively balancing productivity and sustainability objectives. Given the intricate and diverse nature of JSP instances, along with the array of algorithms developed to tackle these challenges, an intelligent algorithm selection tool becomes paramount. This paper introduces a framework designed to identify key problem features that characterize its complexity and guide the selection of suitable algorithms. Leveraging machine learning techniques, particularly XGBoost, the framework recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP scheduling. GUROBI excels with smaller instances, while GECODE demonstrates robust scalability for complex scenarios. The proposed algorithm selector achieves an accuracy of 84.51% in recommending the best algorithm for solving new JSP instances, highlighting its efficacy in algorithm selection. By refining feature extraction methodologies, the framework aims to broaden its applicability across diverse JSP scenarios, thereby advancing efficiency and sustainability in manufacturing logistics.
{"title":"Developing an Algorithm Selector for Green Configuration in Scheduling Problems","authors":"Carlos March, Christian Perez, Miguel A. Salido","doi":"arxiv-2409.08641","DOIUrl":"https://doi.org/arxiv-2409.08641","url":null,"abstract":"The Job Shop Scheduling Problem (JSP) is central to operations research,\u0000primarily optimizing energy efficiency due to its profound environmental and\u0000economic implications. Efficient scheduling enhances production metrics and\u0000mitigates energy consumption, thus effectively balancing productivity and\u0000sustainability objectives. Given the intricate and diverse nature of JSP\u0000instances, along with the array of algorithms developed to tackle these\u0000challenges, an intelligent algorithm selection tool becomes paramount. This\u0000paper introduces a framework designed to identify key problem features that\u0000characterize its complexity and guide the selection of suitable algorithms.\u0000Leveraging machine learning techniques, particularly XGBoost, the framework\u0000recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP\u0000scheduling. GUROBI excels with smaller instances, while GECODE demonstrates\u0000robust scalability for complex scenarios. The proposed algorithm selector\u0000achieves an accuracy of 84.51% in recommending the best algorithm for solving\u0000new JSP instances, highlighting its efficacy in algorithm selection. By\u0000refining feature extraction methodologies, the framework aims to broaden its\u0000applicability across diverse JSP scenarios, thereby advancing efficiency and\u0000sustainability in manufacturing logistics.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Post-training large language models (LLMs) to develop reasoning capabilities has proven effective across diverse domains, such as mathematical reasoning and code generation. However, existing methods primarily focus on improving task-specific reasoning but have not adequately addressed the model's generalization capabilities across a broader range of reasoning tasks. To tackle this challenge, we introduce Critical Planning Step Learning (CPL), which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns step-level planning preferences to improve the model's planning capabilities and, consequently, its general reasoning capabilities. Furthermore, while effective in many scenarios for aligning LLMs, existing preference learning approaches like Direct Preference Optimization (DPO) struggle with complex multi-step reasoning tasks due to their inability to capture fine-grained supervision at each step. We propose Step-level Advantage Preference Optimization (Step-APO), which integrates an advantage estimate for step-level preference pairs obtained via MCTS into the DPO. This enables the model to more effectively learn critical intermediate planning steps, thereby further improving its generalization in reasoning tasks. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH (+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).
{"title":"CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks","authors":"Tianlong Wang, Xueting Han, Jing Bai","doi":"arxiv-2409.08642","DOIUrl":"https://doi.org/arxiv-2409.08642","url":null,"abstract":"Post-training large language models (LLMs) to develop reasoning capabilities\u0000has proven effective across diverse domains, such as mathematical reasoning and\u0000code generation. However, existing methods primarily focus on improving\u0000task-specific reasoning but have not adequately addressed the model's\u0000generalization capabilities across a broader range of reasoning tasks. To\u0000tackle this challenge, we introduce Critical Planning Step Learning (CPL),\u0000which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning\u0000steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns\u0000step-level planning preferences to improve the model's planning capabilities\u0000and, consequently, its general reasoning capabilities. Furthermore, while\u0000effective in many scenarios for aligning LLMs, existing preference learning\u0000approaches like Direct Preference Optimization (DPO) struggle with complex\u0000multi-step reasoning tasks due to their inability to capture fine-grained\u0000supervision at each step. We propose Step-level Advantage Preference\u0000Optimization (Step-APO), which integrates an advantage estimate for step-level\u0000preference pairs obtained via MCTS into the DPO. This enables the model to more\u0000effectively learn critical intermediate planning steps, thereby further\u0000improving its generalization in reasoning tasks. Experimental results\u0000demonstrate that our method, trained exclusively on GSM8K and MATH, not only\u0000significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also\u0000enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH\u0000(+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueli Pan, Jacco van Ossenbruggen, Victor de Boer, Zhisheng Huang
Competency question (CQ) formulation is central to several ontology development and evaluation methodologies. Traditionally, the task of crafting these competency questions heavily relies on the effort of domain experts and knowledge engineers which is often time-consuming and labor-intensive. With the emergence of Large Language Models (LLMs), there arises the possibility to automate and enhance this process. Unlike other similar works which use existing ontologies or knowledge graphs as input to LLMs, we present a retrieval-augmented generation (RAG) approach that uses LLMs for the automatic generation of CQs given a set of scientific papers considered to be a domain knowledge base. We investigate its performance and specifically, we study the impact of different number of papers to the RAG and different temperature setting of the LLM. We conduct experiments using GPT-4 on two domain ontology engineering tasks and compare results against ground-truth CQs constructed by domain experts. Empirical assessments on the results, utilizing evaluation metrics (precision and consistency), reveal that compared to zero-shot prompting, adding relevant domain knowledge to the RAG improves the performance of LLMs on generating CQs for concrete ontology engineering tasks.
{"title":"A RAG Approach for Generating Competency Questions in Ontology Engineering","authors":"Xueli Pan, Jacco van Ossenbruggen, Victor de Boer, Zhisheng Huang","doi":"arxiv-2409.08820","DOIUrl":"https://doi.org/arxiv-2409.08820","url":null,"abstract":"Competency question (CQ) formulation is central to several ontology\u0000development and evaluation methodologies. Traditionally, the task of crafting\u0000these competency questions heavily relies on the effort of domain experts and\u0000knowledge engineers which is often time-consuming and labor-intensive. With the\u0000emergence of Large Language Models (LLMs), there arises the possibility to\u0000automate and enhance this process. Unlike other similar works which use\u0000existing ontologies or knowledge graphs as input to LLMs, we present a\u0000retrieval-augmented generation (RAG) approach that uses LLMs for the automatic\u0000generation of CQs given a set of scientific papers considered to be a domain\u0000knowledge base. We investigate its performance and specifically, we study the\u0000impact of different number of papers to the RAG and different temperature\u0000setting of the LLM. We conduct experiments using GPT-4 on two domain ontology\u0000engineering tasks and compare results against ground-truth CQs constructed by\u0000domain experts. Empirical assessments on the results, utilizing evaluation\u0000metrics (precision and consistency), reveal that compared to zero-shot\u0000prompting, adding relevant domain knowledge to the RAG improves the performance\u0000of LLMs on generating CQs for concrete ontology engineering tasks.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kim van den Houten, Léon Planken, Esteban Freydell, David M. J. Tax, Mathijs de Weerdt
This study investigates scheduling strategies for the stochastic resource-constrained project scheduling problem with maximal time lags (SRCPSP/max)). Recent advances in Constraint Programming (CP) and Temporal Networks have reinvoked interest in evaluating the advantages and drawbacks of various proactive and reactive scheduling methods. First, we present a new, CP-based fully proactive method. Second, we show how a reactive approach can be constructed using an online rescheduling procedure. A third contribution is based on partial order schedules and uses Simple Temporal Networks with Uncertainty (STNUs). Our statistical analysis shows that the STNU-based algorithm performs best in terms of solution quality, while also showing good relative offline and online computation time.
{"title":"Proactive and Reactive Constraint Programming for Stochastic Project Scheduling with Maximal Time-Lags","authors":"Kim van den Houten, Léon Planken, Esteban Freydell, David M. J. Tax, Mathijs de Weerdt","doi":"arxiv-2409.09107","DOIUrl":"https://doi.org/arxiv-2409.09107","url":null,"abstract":"This study investigates scheduling strategies for the stochastic\u0000resource-constrained project scheduling problem with maximal time lags\u0000(SRCPSP/max)). Recent advances in Constraint Programming (CP) and Temporal\u0000Networks have reinvoked interest in evaluating the advantages and drawbacks of\u0000various proactive and reactive scheduling methods. First, we present a new,\u0000CP-based fully proactive method. Second, we show how a reactive approach can be\u0000constructed using an online rescheduling procedure. A third contribution is\u0000based on partial order schedules and uses Simple Temporal Networks with\u0000Uncertainty (STNUs). Our statistical analysis shows that the STNU-based\u0000algorithm performs best in terms of solution quality, while also showing good\u0000relative offline and online computation time.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena
大型语言模型(LLMs)显示出作为计算机代理的巨大潜力,可在需要规划和推理的多模式任务中提高人类的工作效率和软件的可访问性。然而,衡量代理在现实环境中的性能仍然是一项挑战,因为:(i) 大多数基准仅限于特定的模式或领域(如纯文本、网络导航、问答、编码);(ii) 鉴于任务的多步骤连续性,完整的基准评估非常缓慢(以天为单位)。为了应对这些挑战,我们引入了 Windows Agent Arena:这是一个专门针对 Windows 操作系统(OS)的可重现的通用环境,在这里,Agent 可以在真实的 Windows 操作系统中自由操作,并在解决任务时使用与人类用户相同的各种应用程序、工具和网络浏览器。我们调整了 OSWorld 框架(Xie 等人,2024 年),创建了 150 多个具有代表性的 Windows 任务,这些任务要求代理具备规划、屏幕理解和工具使用方面的能力。我们的基准具有可扩展性,可以在 Azure 中进行无缝并行化,在短短 20 分钟内即可完成完整的基准评估。为了展示 Windows Agent Arena 的能力,我们还引入了一个新的多模式代理 Navi。我们的代理在 Windows 领域的成功率为 19.5%,而无人协助的成功率为 74.5%。Navi 还在另一个流行的基于网络的基准测试 Mind2Web 中表现出色。我们对 Navi 的性能进行了广泛的定量和定性分析,并深入探讨了使用 Windows Agent Arena 进行代理开发和数据生成的未来研究机会。网页:https://microsoft.github.io/WindowsAgentArena 代码:https://github.com/microsoft/WindowsAgentArena
{"title":"Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale","authors":"Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui","doi":"arxiv-2409.08264","DOIUrl":"https://doi.org/arxiv-2409.08264","url":null,"abstract":"Large language models (LLMs) show remarkable potential to act as computer\u0000agents, enhancing human productivity and software accessibility in multi-modal\u0000tasks that require planning and reasoning. However, measuring agent performance\u0000in realistic environments remains a challenge since: (i) most benchmarks are\u0000limited to specific modalities or domains (e.g. text-only, web navigation, Q&A,\u0000coding) and (ii) full benchmark evaluations are slow (on order of magnitude of\u0000days) given the multi-step sequential nature of tasks. To address these\u0000challenges, we introduce the Windows Agent Arena: a reproducible, general\u0000environment focusing exclusively on the Windows operating system (OS) where\u0000agents can operate freely within a real Windows OS and use the same wide range\u0000of applications, tools, and web browsers available to human users when solving\u0000tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse\u0000Windows tasks across representative domains that require agent abilities in\u0000planning, screen understanding, and tool usage. Our benchmark is scalable and\u0000can be seamlessly parallelized in Azure for a full benchmark evaluation in as\u0000little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we\u0000also introduce a new multi-modal agent, Navi. Our agent achieves a success rate\u0000of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted\u0000human. Navi also demonstrates strong performance on another popular web-based\u0000benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis\u0000of Navi's performance, and provide insights into the opportunities for future\u0000research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}