Pub Date : 2023-07-01DOI: 10.1609/icaps.v33i1.27250
M. Stadler, Jacopo Banfi, Nicholas A. Roy
For a team of collaborative agents navigating through an unknown environment, collaborative actions such as sensing the traversability of a route can have a large impact on aggregate team performance. However, planning over the full space of joint team actions is generally computationally intractable. Furthermore, typically only a small number of collaborative actions is useful for a given team task, but it is not obvious how to assess the usefulness of a given action. In this work, we model collaborative team policies on stochastic graphs using macro-actions, where each macro-action for a given agent can consist of a sequence of movements, sensing actions, and actions of waiting to receive information from other agents. To reduce the number of macro-actions considered during planning, we generate optimistic approximations of candidate future team states, then restrict the planning domain to a small policy class which consists of only macro-actions which are likely to lead to high-reward future team states. We optimize team plans over the small policy class, and demonstrate that the approach enables a team to find policies which actively balance between reducing task-relevant environmental uncertainty and efficiently navigating to goals in toy graph and island road network domains, finding better plans than policies that do not act to reduce environmental uncertainty.
{"title":"Approximating the Value of Collaborative Team Actions for Efficient Multiagent Navigation in Uncertain Graphs","authors":"M. Stadler, Jacopo Banfi, Nicholas A. Roy","doi":"10.1609/icaps.v33i1.27250","DOIUrl":"https://doi.org/10.1609/icaps.v33i1.27250","url":null,"abstract":"For a team of collaborative agents navigating through an unknown environment, collaborative actions such as sensing the traversability of a route can have a large impact on aggregate team performance. However, planning over the full space of joint team actions is generally computationally intractable. Furthermore, typically only a small number of collaborative actions is useful for a given team task, but it is not obvious how to assess the usefulness of a given action. In this work, we model collaborative team policies on stochastic graphs using macro-actions, where each macro-action for a given agent can consist of a sequence of movements, sensing actions, and actions of waiting to receive information from other agents. To reduce the number of macro-actions considered during planning, we generate optimistic approximations of candidate future team states, then restrict the planning domain to a small policy class which consists of only macro-actions which are likely to lead to high-reward future team states. We optimize team plans over the small policy class, and demonstrate that the approach enables a team to find policies which actively balance between reducing task-relevant environmental uncertainty and efficiently navigating to goals in toy graph and island road network domains, finding better plans than policies that do not act to reduce environmental uncertainty.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"382 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133455968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1609/icaps.v33i1.27224
Chenyuan Zhang, Charles Kemp, N. Lipovetzky
Goal recognition has been extensively studied by AI researchers, but most algorithms take only observed actions as input. Here we argue that the time taken to carry out these actions provides an additional signal that supports goal recognition. We present a behavioral experiment confirming that people use timing information in this way, and develop and evaluate a goal recognition algorithm that is sensitive to both actions and timing information. Our results suggest that existing goal recognition algorithms can be improved by incorporating a model of planning time on both synthetic data and human data, and that these improvements can be substantial in scenarios in which relatively few actions have been observed.
{"title":"Goal Recognition with Timing Information","authors":"Chenyuan Zhang, Charles Kemp, N. Lipovetzky","doi":"10.1609/icaps.v33i1.27224","DOIUrl":"https://doi.org/10.1609/icaps.v33i1.27224","url":null,"abstract":"Goal recognition has been extensively studied by AI researchers, but most algorithms take only observed actions as input. Here we argue that the time taken to carry out these actions provides an additional signal that supports goal recognition. We present a behavioral experiment confirming that people use timing information in this way, and develop and evaluate a goal recognition algorithm that is sensitive to both actions and timing information. Our results suggest that existing goal recognition algorithms can be improved by incorporating a model of planning time on both synthetic data and human data, and that these improvements can be substantial in scenarios in which relatively few actions have been observed.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123156162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1609/icaps.v33i1.27192
M. Horn, Emir Demirovic, N. Yorke-Smith
We solve a challenging scheduling problem with parallel batch processing and two-dimensional shelf strip packing constraints that arises in the tool coating field. Tools are assembled on so-called planetaries (batches) before they are loaded into coating machines to get coated. The assembling is not trivial and must fulfil specific constraints, which we refer to as shelf strip packing constraints. Further, each tool is associated with a starting time window s.t. tools can only be put on the same planetary if their time window overlap. The objective is to minimise the makespan and the number of required planetaries. Since the problem naturally decomposes into scheduling and packing parts, we tackle the problem with a two-phase logic-based Benders decomposition approach. The master problem assigns items to batches. The first phase solves as subproblem the packing problem by checking if the assignment is feasible, whereas the second phase solves the scheduling subproblem. The approach is compared with a monolithic mixed integer linear programming approach as well as a monolithic constraint programming approach. Experimental evaluation shows that our proposed approach outperforms the state-of-the-art benchmarks by solving more instances to optimality in a shorter time.
{"title":"Parallel Batch Processing for the Coating Problem","authors":"M. Horn, Emir Demirovic, N. Yorke-Smith","doi":"10.1609/icaps.v33i1.27192","DOIUrl":"https://doi.org/10.1609/icaps.v33i1.27192","url":null,"abstract":"We solve a challenging scheduling problem with parallel batch processing and two-dimensional shelf strip packing constraints that arises in the tool coating field. Tools are assembled on so-called planetaries (batches) before they are loaded into coating machines to get coated. The assembling is not trivial and must fulfil specific constraints, which we refer to as shelf strip packing constraints. Further, each tool is associated with a starting time window s.t. tools can only be put on the same planetary if their time window overlap. The objective is to minimise the makespan and the number of required planetaries. Since the problem naturally decomposes into scheduling and packing parts, we tackle the problem with a two-phase logic-based Benders decomposition approach. The master problem assigns items to batches. The first phase solves as subproblem the packing problem by checking if the assignment is feasible, whereas the second phase solves the scheduling subproblem. The approach is compared with a monolithic mixed integer linear programming approach as well as a monolithic constraint programming approach. Experimental evaluation shows that our proposed approach outperforms the state-of-the-art benchmarks by solving more instances to optimality in a shorter time.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1609/icaps.v33i1.27229
M. Horn, Emir Demirovic, N. Yorke-Smith
In the tool coating field, scheduling of production lines requires solving an optimisation problem which we call the multi-choice two-dimensional shelf strip packing problem with time windows. A set of rectangular items needs to be packed in two stages: items are placed on shelves, which in turn are placed on one of several available strips. Crucially, the item's width depends on the chosen strip and each item is associated with a time window such that items can only be placed on the same shelf if their time windows overlap. In collaboration with an industrial partner, this real-world optimisation problem is tackled in this paper by both exact and heuristic methods. The exact method is an arc-flow-based integer linear programming formulation, solved with the commercial solver CPLEX. Experimental evaluation shows that this approach can solve instances to proven optimality with up to 20 different item sizes. Larger, more realistic instances are solved heuristically by an adaptive large neighbourhood search, using first fit and best fit decreasing approaches as repair heuristics. In this way, we obtain high-quality solutions with a remaining optimality gap below 3.3% for instances with up to 2000 different item sizes. The work reported is due to be incorporated into an end-to-end decision support system with the industrial partner.
{"title":"Solving the Multi-Choice Two Dimensional Shelf Strip Packing Problem with Time Windows","authors":"M. Horn, Emir Demirovic, N. Yorke-Smith","doi":"10.1609/icaps.v33i1.27229","DOIUrl":"https://doi.org/10.1609/icaps.v33i1.27229","url":null,"abstract":"In the tool coating field, scheduling of production lines requires solving an optimisation problem which we call the multi-choice two-dimensional shelf strip packing problem with time windows. A set of rectangular items needs to be packed in two stages: items are placed on shelves, which in turn are placed on one of several available strips. Crucially, the item's width depends on the chosen strip and each item is associated with a time window such that items can only be placed on the same shelf if their time windows overlap. In collaboration with an industrial partner, this real-world optimisation problem is tackled in this paper by both exact and heuristic methods. The exact method is an arc-flow-based integer linear programming formulation, solved with the commercial solver CPLEX. Experimental evaluation shows that this approach can solve instances to proven optimality with up to 20 different item sizes. Larger, more realistic instances are solved heuristically by an adaptive large neighbourhood search, using first fit and best fit decreasing approaches as repair heuristics. In this way, we obtain high-quality solutions with a remaining optimality gap below 3.3% for instances with up to 2000 different item sizes. The work reported is due to be incorporated into an end-to-end decision support system with the industrial partner.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"12 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114093339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-01DOI: 10.1609/icaps.v33i1.27220
David Speck, Paul Höft, Daniel Gnad, Jendrik Seipp
Matrix multiplication is a fundamental operation of linear algebra, with applications ranging from quantum physics to artificial intelligence. Given its importance, enormous resources have been invested in the search for faster matrix multiplication algorithms. Recently, this search has been cast as a single-player game. By learning how to play this game efficiently, the newly-introduced AlphaTensor reinforcement learning agent is able to discover many new faster algorithms. In this paper, we show that finding matrix multiplication algorithms can also be cast as a classical planning problem. Based on this observation, we introduce a challenging benchmark suite for classical planning and evaluate state-of-the-art planning techniques on it. We analyze the strengths and limitations of different planning approaches in this domain and show that we can use classical planning to find lower bounds and concrete algorithms for matrix multiplication.
{"title":"Finding Matrix Multiplication Algorithms with Classical Planning","authors":"David Speck, Paul Höft, Daniel Gnad, Jendrik Seipp","doi":"10.1609/icaps.v33i1.27220","DOIUrl":"https://doi.org/10.1609/icaps.v33i1.27220","url":null,"abstract":"Matrix multiplication is a fundamental operation of linear algebra, with applications ranging from quantum physics to artificial intelligence. Given its importance, enormous resources have been invested in the search for faster matrix multiplication algorithms. Recently, this search has been cast as a single-player game. By learning how to play this game efficiently, the newly-introduced AlphaTensor reinforcement learning agent is able to discover many new faster algorithms. In this paper, we show that finding matrix multiplication algorithms can also be cast as a classical planning problem. Based on this observation, we introduce a challenging benchmark suite for classical planning and evaluate state-of-the-art planning techniques on it. We analyze the strengths and limitations of different planning approaches in this domain and show that we can use classical planning to find lower bounds and concrete algorithms for matrix multiplication.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126296536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-11DOI: 10.48550/arXiv.2304.05271
Yash Shukla, A. Kulkarni, R. Wright, Alvaro Velasquez, J. Sinapov
Despite advances in Reinforcement Learning, many sequential decision making tasks remain prohibitively expensive and impractical to learn. Recently, approaches that automatically generate reward functions from logical task specifications have been proposed to mitigate this issue; however, they scale poorly on long-horizon tasks (i.e., tasks where the agent needs to perform a series of correct actions to reach the goal state, considering future transitions while choosing an action). Employing a curriculum (a sequence of increasingly complex tasks) further improves the learning speed of the agent by sequencing intermediate tasks suited to the learning capacity of the agent. However, generating curricula from the logical specification still remains an unsolved problem. To this end, we propose AGCL, Automaton-guided Curriculum Learning, a novel method for automatically generating curricula for the target task in the form of Directed Acyclic Graphs (DAGs). AGCL encodes the specification in the form of a deterministic finite automaton (DFA), and then uses the DFA along with the Object-Oriented MDP (OOMDP) representation to generate a curriculum as a DAG, where the vertices correspond to tasks, and edges correspond to the direction of knowledge transfer. Experiments in gridworld and physics-based simulated robotics domains show that the curricula produced by AGCL achieve improved time-to-threshold performance on a complex sequential decision-making problem relative to state-of-the-art curriculum learning (e.g, teacher-student, self-play) and automaton-guided reinforcement learning baselines (e.g, Q-Learning for Reward Machines). Further, we demonstrate that AGCL performs well even in the presence of noise in the task's OOMDP description, and also when distractor objects are present that are not modeled in the logical specification of the tasks' objectives.
{"title":"Automaton-Guided Curriculum Generation for Reinforcement Learning Agents","authors":"Yash Shukla, A. Kulkarni, R. Wright, Alvaro Velasquez, J. Sinapov","doi":"10.48550/arXiv.2304.05271","DOIUrl":"https://doi.org/10.48550/arXiv.2304.05271","url":null,"abstract":"Despite advances in Reinforcement Learning, many sequential decision making tasks remain prohibitively expensive and impractical to learn. Recently, approaches that automatically generate reward functions from logical task specifications have been proposed to mitigate this issue; however, they scale poorly on long-horizon tasks (i.e., tasks where the agent needs to perform a series of correct actions to reach the goal state, considering future transitions while choosing an action). Employing a curriculum (a sequence of increasingly complex tasks) further improves the learning speed of the agent by sequencing intermediate tasks suited to the learning capacity of the agent. However, generating curricula from the logical specification still remains an unsolved problem. To this end, we propose AGCL, Automaton-guided Curriculum Learning, a novel method for automatically generating curricula for the target task in the form of Directed Acyclic Graphs (DAGs). AGCL encodes the specification in the form of a deterministic finite automaton (DFA), and then uses the DFA along with the Object-Oriented MDP (OOMDP) representation to generate a curriculum as a DAG, where the vertices correspond to tasks, and edges correspond to the direction of knowledge transfer. Experiments in gridworld and physics-based simulated robotics domains show that the curricula produced by AGCL achieve improved time-to-threshold performance on a complex sequential decision-making problem relative to state-of-the-art curriculum learning (e.g, teacher-student, self-play) and automaton-guided reinforcement learning baselines (e.g, Q-Learning for Reward Machines). Further, we demonstrate that AGCL performs well even in the presence of noise in the task's OOMDP description, and also when distractor objects are present that are not modeled in the logical specification of the tasks' objectives.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122566494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-23DOI: 10.48550/arXiv.2303.13385
D. Saxena, M. Likhachev
We are interested in pick-and-place style robot manipulation tasks in cluttered and confined 3D workspaces among movable objects that may be rearranged by the robot and may slide, tilt, lean or topple. A recently proposed algorithm, M4M, determines which objects need to be moved and where by solving a Multi-Agent Pathfinding (MAPF) abstraction of this problem. It then utilises a nonprehensile push planner to compute actions for how the robot might realise these rearrangements and a rigid body physics simulator to check whether the actions satisfy physics constraints encoded in the problem. However, M4M greedily commits to valid pushes found during planning, and does not reason about orderings over pushes if multiple objects need to be rearranged. Furthermore, M4M does not reason about other possible MAPF solutions that lead to different rearrangements and pushes. In this paper, we extend M4M and present Enhanced-M4M (E-M4M) -- a systematic graph search-based solver that searches over orderings of pushes for movable objects that need to be rearranged and different possible rearrangements of the scene. We introduce several algorithmic optimisations to circumvent the increased computational complexity, discuss the space of problems solvable by E-M4M and show that experimentally, both on the real robot and in simulation, it significantly outperforms the original M4M algorithm, as well as other state-of-the-art alternatives when dealing with complex scenes.
{"title":"Planning for Manipulation among Movable Objects: Deciding Which Objects Go Where, in What Order, and How","authors":"D. Saxena, M. Likhachev","doi":"10.48550/arXiv.2303.13385","DOIUrl":"https://doi.org/10.48550/arXiv.2303.13385","url":null,"abstract":"We are interested in pick-and-place style robot manipulation tasks in cluttered and confined 3D workspaces among movable objects that may be rearranged by the robot and may slide, tilt, lean or topple. A recently proposed algorithm, M4M, determines which objects need to be moved and where by solving a Multi-Agent Pathfinding (MAPF) abstraction of this problem. It then utilises a nonprehensile push planner to compute actions for how the robot might realise these rearrangements and a rigid body physics simulator to check whether the actions satisfy physics constraints encoded in the problem. However, M4M greedily commits to valid pushes found during planning, and does not reason about orderings over pushes if multiple objects need to be rearranged. Furthermore, M4M does not reason about other possible MAPF solutions that lead to different rearrangements and pushes. In this paper, we extend M4M and present Enhanced-M4M (E-M4M) -- a systematic graph search-based solver that searches over orderings of pushes for movable objects that need to be rearranged and different possible rearrangements of the scene. We introduce several algorithmic optimisations to circumvent the increased computational complexity, discuss the space of problems solvable by E-M4M and show that experimentally, both on the real robot and in simulation, it significantly outperforms the original M4M algorithm, as well as other state-of-the-art alternatives when dealing with complex scenes.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116200852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-16DOI: 10.48550/arXiv.2303.09528
A. Falah, Shibashis Guha, Ashutosh Trivedi
Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes. Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ''good states'' of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.
{"title":"Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP","authors":"A. Falah, Shibashis Guha, Ashutosh Trivedi","doi":"10.48550/arXiv.2303.09528","DOIUrl":"https://doi.org/10.48550/arXiv.2303.09528","url":null,"abstract":"Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes. Unfortunately, no automatic translation exists for CTMDPs.\u0000 \u0000 We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ''good states'' of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122981566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-14DOI: 10.48550/arXiv.2303.08271
Merlijn Krale, T. D. Simão, N. Jansen
We study Markov decision processes (MDPs), where agents control when and how they gather information, as formalized by action-contingent noiselessly observable MDPs (ACNO-MPDs). In these models, actions have two components: a control action that influences how the environment changes and a measurement action that affects the agent's observation. To solve ACNO-MDPs, we introduce the act-then-measure (ATM) heuristic, which assumes that we can ignore future state uncertainty when choosing control actions. To decide whether or not to measure, we introduce the concept of measuring value. We show how following this heuristic may lead to shorter policy computation times and prove a bound on the performance loss it incurs. We develop a reinforcement learning algorithm based on the ATM heuristic, using a Dyna-Q variant adapted for partially observable domains, and showcase its superior performance compared to prior methods on a number of partially-observable environments.
{"title":"Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring","authors":"Merlijn Krale, T. D. Simão, N. Jansen","doi":"10.48550/arXiv.2303.08271","DOIUrl":"https://doi.org/10.48550/arXiv.2303.08271","url":null,"abstract":"We study Markov decision processes (MDPs), where agents control when and how they gather information, as formalized by action-contingent noiselessly observable MDPs (ACNO-MPDs). In these models, actions have two components: a control action that influences how the environment changes and a measurement action that affects the agent's observation. To solve ACNO-MDPs, we introduce the act-then-measure (ATM) heuristic, which assumes that we can ignore future state uncertainty when choosing control actions. To decide whether or not to measure, we introduce the concept of measuring value. We show how following this heuristic may lead to shorter policy computation times and prove a bound on the performance loss it incurs. We develop a reinforcement learning algorithm based on the ATM heuristic, using a Dyna-Q variant adapted for partially observable domains, and showcase its superior performance compared to prior methods on a number of partially-observable environments.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124642646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-09DOI: 10.48550/arXiv.2303.05622
Abeer Alshehri, Tim Miller, Mor Vered
We introduce and evaluate an eXplainable goal recognition (XGR) model that uses the Weight of Evidence (WoE) framework to explain goal recognition problems. Our model provides human-centered explanations that answer `why?' and `why not?' questions. We computationally evaluate the performance of our system over eight different goal recognition domains showing it does not significantly increase the underlying recognition run time. Using a human behavioral study to obtain the ground truth from human annotators, we further show that the XGR model can successfully generate human-like explanations. We then report on a study with 40 participants who observe agents playing a Sokoban game and then receive explanations of the goal recognition output. We investigated participants’ understanding obtained by explanations through task prediction, explanation satisfaction, and trust.
{"title":"Explainable Goal Recognition: A Framework Based on Weight of Evidence","authors":"Abeer Alshehri, Tim Miller, Mor Vered","doi":"10.48550/arXiv.2303.05622","DOIUrl":"https://doi.org/10.48550/arXiv.2303.05622","url":null,"abstract":"We introduce and evaluate an eXplainable goal recognition (XGR) model that uses the Weight of Evidence (WoE) framework to explain goal recognition problems. Our model provides human-centered explanations that answer `why?' and `why not?' questions. We computationally evaluate the performance of our system over eight different goal recognition domains showing it does not significantly increase the underlying recognition run time. Using a human behavioral study to obtain the ground truth from human annotators, we further show that the XGR model can successfully generate human-like explanations. We then report on a study with 40 participants who observe agents playing a Sokoban game and then receive explanations of the goal recognition output. We investigated participants’ understanding obtained by explanations through task prediction, explanation satisfaction, and trust.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124791246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}