Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19818
Bojie Shen, M. A. Cheema, Daniel D. Harabor, P. J. Stuckey
Computing time-optimal shortest paths, in road networks, is one of the most popular applications of Artificial Intelligence. This problem is tricky to solve because road congestion affects travel times. The state-of-the-art in this area is an algorithm called Time-dependent Contraction Hierarchies (TCH). Although fast and optimal, TCH still suffers from two main drawbacks: (1) the usual query process uses bi-directional Dijkstra search to find the shortest path, which can be time-consuming; and (2) the TCH is constructed w.r.t. the entire time domain T, which complicates the search process for queries q that start and finish in a smaller time period Tq ⊂ T. In this work, we improve TCH by making use of time-independent heuristics, which speed up optimal search, and by computing TCHs for different subsets of the time domain, which further reduces the size of the search space. We give a full description of these methods and discuss their optimality-preserving characteristics. We report significant query time improvements against a baseline implementation of TCH.
{"title":"Improving Time-Dependent Contraction Hierarchies","authors":"Bojie Shen, M. A. Cheema, Daniel D. Harabor, P. J. Stuckey","doi":"10.1609/icaps.v32i1.19818","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19818","url":null,"abstract":"Computing time-optimal shortest paths, in road networks, is one of the most popular applications of Artificial Intelligence. This problem is tricky to solve because road congestion affects travel times. The state-of-the-art in this area is an algorithm called Time-dependent Contraction Hierarchies (TCH). Although fast and optimal, TCH still suffers from two main drawbacks: (1) the usual query process uses bi-directional Dijkstra search to find the shortest path, which can be time-consuming; and (2) the TCH is constructed w.r.t. the entire time domain T, which complicates the search process for queries q that start and finish in a smaller time period Tq ⊂ T. In this work, we improve TCH by making use of time-independent heuristics, which speed up optimal search, and by computing TCHs for different subsets of the time domain, which further reduces the size of the search space. We give a full description of these methods and discuss their optimality-preserving characteristics. We report significant query time improvements against a baseline implementation of TCH.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133002538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19831
Waldy Joe, H. Lau, Jonathan Pan
Police patrol aims to fulfill two main objectives namely to project presence and to respond to incidents in a timely manner. Incidents happen dynamically and can disrupt the initially-planned patrol schedules. The key decisions to be made will be which patrol agent to be dispatched to respond to an incident and subsequently how to adapt the patrol schedules in response to such dynamically-occurring incidents whilst still fulfilling both objectives; which sometimes can be conflicting. In this paper, we define this real-world problem as a Dynamic Bi-Objective Police Patrol Dispatching and Rescheduling Problem and propose a solution approach that combines Deep Reinforcement Learning (specifically neural networks-based Temporal-Difference learning with experience replay) to approximate the value function and a rescheduling heuristic based on ejection chains to learn both dispatching and rescheduling policies jointly. To address the dual objectives, we propose a reward function that implicitly tries to maximize the rate of successfully responding to an incident within a response time target while minimizing the reduction in patrol presence without the need to explicitly set predetermined weights for each objective. The proposed approach is able to compute both dispatching and rescheduling decisions almost instantaneously. Our work serves as the first work in the literature that takes into account these dual patrol objectives and real-world operational consideration where incident response may disrupt existing patrol schedules.
{"title":"Reinforcement Learning Approach to Solve Dynamic Bi-objective Police Patrol Dispatching and Rescheduling Problem","authors":"Waldy Joe, H. Lau, Jonathan Pan","doi":"10.1609/icaps.v32i1.19831","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19831","url":null,"abstract":"Police patrol aims to fulfill two main objectives namely to project presence and to respond to incidents in a timely manner. Incidents happen dynamically and can disrupt the initially-planned patrol schedules. The key decisions to be made will be which patrol agent to be dispatched to respond to an incident and subsequently how to adapt the patrol schedules in response to such dynamically-occurring incidents whilst still fulfilling both objectives; which sometimes can be conflicting. In this paper, we define this real-world problem as a Dynamic Bi-Objective Police Patrol Dispatching and Rescheduling Problem and propose a solution approach that combines Deep Reinforcement Learning (specifically neural networks-based Temporal-Difference learning with experience replay) to approximate the value function and a rescheduling heuristic based on ejection chains to learn both dispatching and rescheduling policies jointly. To address the dual objectives, we propose a reward function that implicitly tries to maximize the rate of successfully responding to an incident within a response time target while minimizing the reduction in patrol presence without the need to explicitly set predetermined weights for each objective. The proposed approach is able to compute both dispatching and rescheduling decisions almost instantaneously. Our work serves as the first work in the literature that takes into account these dual patrol objectives and real-world operational consideration where incident response may disrupt existing patrol schedules.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134123566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19853
Noah Topper, George K. Atia, Ashutosh Trivedi, Alvaro Velasquez
Planning in finite stochastic environments is canonically posed as a Markov decision process where the transition and reward structures are explicitly known. Reinforcement learning (RL) lifts the explicitness assumption by working with sampling models instead. Further, with the advent of reward machines, we can relax the Markovian assumption on the reward. Angluin's active grammatical inference algorithm L* has found novel application in explicating reward machines for non-Markovian RL. We propose maintaining the assumption of explicit transition dynamics, but with an implicit non-Markovian reward signal, which must be inferred from experiments. We call this setting non-Markovian planning, as opposed to non-Markovian RL. The proposed approach leverages L* to explicate an automaton structure for the underlying planning objective. We exploit the environment model to learn an automaton faster and integrate it with value iteration to accelerate the planning. We compare against recent non-Markovian RL solutions which leverage grammatical inference, and establish complexity results that illustrate the difference in runtime between grammatical inference in planning and RL settings.
{"title":"Active Grammatical Inference for Non-Markovian Planning","authors":"Noah Topper, George K. Atia, Ashutosh Trivedi, Alvaro Velasquez","doi":"10.1609/icaps.v32i1.19853","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19853","url":null,"abstract":"Planning in finite stochastic environments is canonically posed as a Markov decision process where the transition and reward structures are explicitly known. Reinforcement learning (RL) lifts the explicitness assumption by working with sampling models instead. Further, with the advent of reward machines, we can relax the Markovian assumption on the reward. Angluin's active grammatical inference algorithm L* has found novel application in explicating reward machines for non-Markovian RL. We propose maintaining the assumption of explicit transition dynamics, but with an implicit non-Markovian reward signal, which must be inferred from experiments. We call this setting non-Markovian planning, as opposed to non-Markovian RL. The proposed approach leverages L* to explicate an automaton structure for the underlying planning objective. We exploit the environment model to learn an automaton faster and integrate it with value iteration to accelerate the planning. We compare against recent non-Markovian RL solutions which leverage grammatical inference, and establish complexity results that illustrate the difference in runtime between grammatical inference in planning and RL settings.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133342300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Vehicle Routing Problem with Roaming Delivery Locations (VRPRDL) is a variant of the Vehicle Routing Problem (VRP) in which a customer can be present at many locations during a working day and a time window is associated with each location. The objective is to find a set of routes such that (i) the total traveling cost is minimized, (ii) only one location of each customer is visited within its time window, and (iii) all capacity constraints are satisfied. To solve the problem, we introduce a hybrid genetic algorithm which relies on problem-tailored solution representation, mutation, local search operators, as well as a set covering component exploring routes found during the search to find better solutions. We also propose a new split procedure which based on dynamic programming to evaluate the fitness of chromosomes. Experiments conducted on the benchmark instances clearly show that our proposed algorithm outperforms existing approaches in terms of stability and solution quality. We also improve 49 best known solutions of the literature.
{"title":"A Hybrid Genetic Algorithm for the Vehicle Routing Problem with Roaming Delivery Locations","authors":"Quang Anh Pham, Minh Hoàng Hà, Duy Manh Vu, Huy-Hoang Nguyen","doi":"10.1609/icaps.v32i1.19813","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19813","url":null,"abstract":"The Vehicle Routing Problem with Roaming Delivery Locations (VRPRDL) is a variant of the Vehicle Routing Problem (VRP) in which a customer can be present at many locations during a working day and a time window is associated with each location. The objective is to find a set of routes such that (i) the total traveling cost is minimized, (ii) only one location of each customer is visited within its time window, and (iii) all capacity constraints are satisfied. To solve the problem, we introduce a hybrid genetic algorithm which relies on problem-tailored solution representation, mutation, local search operators, as well as a set covering component exploring routes found during the search to find better solutions. We also propose a new split procedure which based on dynamic programming to evaluate the fitness of chromosomes. Experiments conducted on the benchmark instances clearly show that our proposed algorithm outperforms existing approaches in terms of stability and solution quality. We also improve 49 best known solutions of the literature.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116604066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19829
M. Beeks, Reza Refaei Afshar, Yingqian Zhang, R. Dijkman, Claudy van Dorst, S. D. Looijer
On-time delivery and low service costs are two important performance metrics in warehousing operations. This paper proposes a Deep Reinforcement Learning (DRL) based approach to solve the online Order Batching and Sequence Problem (OBSP) to optimize these two objectives. To learn how to balance the trade-off between two objectives, we introduce a Bayesian optimization framework to shape the reward function of the DRL agent, such that the influences of learning to these objectives are adjusted to different environments. We compare our approach with several heuristics using problem instances of real-world size where thousands of orders arrive dynamically per hour. We show the Proximal Policy Optimization (PPO) algorithm with Bayesian optimization outperforms the heuristics in all tested scenarios on both objectives. In addition, it finds different weights for the components in the reward function in different scenarios, indicating its capability of learning how to set the importance of two objectives under different environments. We also provide policy analysis on the learned DRL agent, where a decision tree is used to infer decision rules to enable the interpretability of the DRL approach.
{"title":"Deep Reinforcement Learning for a Multi-Objective Online Order Batching Problem","authors":"M. Beeks, Reza Refaei Afshar, Yingqian Zhang, R. Dijkman, Claudy van Dorst, S. D. Looijer","doi":"10.1609/icaps.v32i1.19829","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19829","url":null,"abstract":"On-time delivery and low service costs are two important performance metrics in warehousing operations. This paper proposes a Deep Reinforcement Learning (DRL) based approach to solve the online Order Batching and Sequence Problem (OBSP) to optimize these two objectives. \u0000To learn how to balance the trade-off between two objectives, we introduce a Bayesian optimization framework to shape the reward function of the DRL agent, such that the influences of learning to these objectives are adjusted to different environments. We compare our approach with several heuristics using problem instances of real-world size where thousands of orders arrive dynamically per hour. \u0000We show the Proximal Policy Optimization (PPO) algorithm with Bayesian optimization outperforms the heuristics in all tested scenarios on both objectives. In addition, it finds different weights for the components in the reward function in different scenarios, indicating its capability of learning how to set the importance of two objectives under different environments. We also provide policy analysis on the learned DRL agent, where a decision tree is used to infer decision rules to enable the interpretability of the DRL approach.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123662464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19857
G. Cantareira, Gerard Canal, R. Borgo
As we grow more reliant on AI systems for an increasing variety of applications in our lives, the need to understand and interpret such systems also becomes more pronounced, be it for improvement, trust, or legal liability. AI Planning is one type of task that provides explanation challenges, particularly due to the increasing complexity in generated plans and convoluted causal chains that connect actions and determine overall plan structure. While there are many recent techniques to support plan explanation, visual aids for navigating this data are quite limited. Furthermore, there is often a barrier between techniques focused on abstract planning concepts and domain-related explanations. In this paper, we present a visual analytics tool to support plan summarization and interaction, focusing in robotics domains using an actor-based structure. We show how users can quickly grasp vital information about actions involved in a plan and how they relate to each other. Finally, we present a framework used to design our tool, highlighting how general PDDL elements can be converted into visual representations and further connecting concept to domain.
{"title":"Actor-Focused Interactive Visualization for AI Planning","authors":"G. Cantareira, Gerard Canal, R. Borgo","doi":"10.1609/icaps.v32i1.19857","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19857","url":null,"abstract":"As we grow more reliant on AI systems for an increasing variety of applications in our lives, the need to understand and interpret such systems also becomes more pronounced, be it for improvement, trust, or legal liability. AI Planning is one type of task that provides explanation challenges, particularly due to the increasing complexity in generated plans and convoluted causal chains that connect actions and determine overall plan structure. While there are many recent techniques to support plan explanation, visual aids for navigating this data are quite limited. Furthermore, there is often a barrier between techniques focused on abstract planning concepts and domain-related explanations. In this paper, we present a visual analytics tool to support plan summarization and interaction, focusing in robotics domains using an actor-based structure. We show how users can quickly grasp vital information about actions involved in a plan and how they relate to each other. Finally, we present a framework used to design our tool, highlighting how general PDDL elements can be converted into visual representations and further connecting concept to domain.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129629248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19822
Marcel Vinzent, Marcel Steinmetz, Jörg Hoffmann
Neural networks (NN) are an increasingly important representation of action policies. Verifying that such policies are safe is potentially very hard as it compounds the state space explosion with the difficulty of analyzing even single NN decision episodes. Here we address that challenge through abstract reachability analysis. We show how to compute predicate abstractions of the policy state space subgraph induced by fixing an NN action policy. A key sub-problem here is the computation of abstract state transitions that may be taken by the policy, which as we show can be tackled by connecting to off-the-shelf SMT solvers. We devise a range of algorithmic enhancements, leveraging relaxed tests to avoid costly calls to SMT. We empirically evaluate the resulting machinery on a collection of benchmarks. The results show that our enhancements are required for practicality, and that our approach can outperform two competing approaches based on explicit enumeration and bounded-length verification.
{"title":"Neural Network Action Policy Verification via Predicate Abstraction","authors":"Marcel Vinzent, Marcel Steinmetz, Jörg Hoffmann","doi":"10.1609/icaps.v32i1.19822","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19822","url":null,"abstract":"Neural networks (NN) are an increasingly important representation of action policies. Verifying that such policies are safe is potentially very hard as it compounds the state space explosion with the difficulty of analyzing even single NN decision episodes. Here we address that challenge through abstract reachability analysis. We show how to compute predicate abstractions of the policy state space subgraph induced by fixing an NN action policy. A key sub-problem here is the computation of abstract state transitions that may be taken by the policy, which as we show can be tackled by connecting to off-the-shelf SMT solvers. We devise a range of algorithmic enhancements, leveraging relaxed tests to avoid costly calls to SMT. We empirically evaluate the resulting machinery on a collection of benchmarks. The results show that our enhancements are required for practicality, and that our approach can outperform two competing approaches based on explicit enumeration and bounded-length verification.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115244297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19843
Simon Boegh, P. G. Jensen, Martin Kristjansen, K. Larsen, Ulrik Nyman
We consider dynamic route planning for a fleet of Autonomous Mobile Robots (AMRs) doing fetch and carry tasks on a shared factory floor. In this paper, we propose Stochastic Work Graphs (SWG) as a formalism for capturing the semantics of such distributed and uncertain planning problems. We encode SWGs in the form of a Euclidean Markov Decision Process (EMDP) in the tool Uppaal Stratego, which employs Q-Learning to synthesize near-optimal plans. Furthermore, we deploy the tool in an online and distributed fashion to facilitate scalable, rapid replanning. While executing their current plan, each AMR generates a new plan incorporating updated information about the other AMRs positions and plans. We propose a two-layer Model Predictive Controller-structure (waypoint and station planning), each individually solved by the Q-learning-based solver. We demonstrate our approach using ARGoS3 large-scale robot simulation, where we simulate the AMR movement and observe an up to 27.5% improvement in makespan over a greedy approach to planning. To do so, we have implemented the full software stack, translating observations into SWGs and solving those with our proposed method. In addition, we construct a benchmark platform for comparing planning techniques on a reasonably realistic physical simulation and provide this under the MIT open-source license.
{"title":"Distributed Fleet Management in Noisy Environments via Model-Predictive Control","authors":"Simon Boegh, P. G. Jensen, Martin Kristjansen, K. Larsen, Ulrik Nyman","doi":"10.1609/icaps.v32i1.19843","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19843","url":null,"abstract":"We consider dynamic route planning for a fleet of Autonomous Mobile Robots (AMRs) doing fetch and carry tasks on a shared factory floor. In this paper, we propose Stochastic Work Graphs (SWG) as a formalism for capturing the semantics of such distributed and uncertain planning problems. We encode SWGs in the form of a Euclidean Markov Decision Process (EMDP) in the tool Uppaal Stratego, which employs Q-Learning to synthesize near-optimal plans. Furthermore, we deploy the tool in an online and distributed fashion to facilitate scalable, rapid replanning. While executing their current plan, each AMR generates a new plan incorporating updated information about the other AMRs positions and plans. We propose a two-layer Model Predictive Controller-structure (waypoint and station planning), each individually solved by the Q-learning-based solver. We demonstrate our approach using ARGoS3 large-scale robot simulation, where we simulate the AMR movement and observe an up to 27.5% improvement in makespan over a greedy approach to planning. To do so, we have implemented the full software stack, translating observations into SWGs and solving those with our proposed method. In addition, we construct a benchmark platform for comparing planning techniques on a reasonably realistic physical simulation and provide this under the MIT open-source license.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115691364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19820
Marcel Steinmetz, Daniel Fiser, Hasan Ferit Eniser, Patrick Ferber, Timo P. Gros, Philippe, Heim, D. Höller, Xandra Schuler, Valentin Wüstholz, M. Christakis, Jörg Hoffmann
Testing is a promising way to gain trust in neural action policies π. Previous work on policy testing in sequential decision making targeted environment behavior leading to failure conditions. But if the failure is unavoidable given that behavior, then π is not actually to blame. For a situation to qualify as a "bug" in π, there must be an alternative policy π' that does better. We introduce a generic policy testing framework based on that intuition. This raises the bug confirmation problem, deciding whether or not a state is a bug. We analyze the use of optimistic and pessimistic bounds for the design of test oracles approximating that problem. We contribute an implementation of our framework in classical planning, experimenting with several test oracles and with random-walk methods generating test states biased to poor policy performance and/or state novelty. We evaluate these techniques on policies π learned with ASNets. We find that they are able to effectively identify bugs in these π, and that our random-walk biases improve over uninformed baselines.
{"title":"Debugging a Policy: Automatic Action-Policy Testing in AI Planning","authors":"Marcel Steinmetz, Daniel Fiser, Hasan Ferit Eniser, Patrick Ferber, Timo P. Gros, Philippe, Heim, D. Höller, Xandra Schuler, Valentin Wüstholz, M. Christakis, Jörg Hoffmann","doi":"10.1609/icaps.v32i1.19820","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19820","url":null,"abstract":"Testing is a promising way to gain trust in neural action policies π. Previous work on policy testing in sequential decision making targeted environment behavior leading to failure conditions. But if the failure is unavoidable given that behavior, then π is not actually to blame. For a situation to qualify as a \"bug\" in π, there must be an alternative policy π' that does better. We introduce a generic policy testing framework based on that intuition. This raises the bug confirmation problem, deciding whether or not a state is a bug. We analyze the use of optimistic and pessimistic bounds for the design of test oracles approximating that problem. We contribute an implementation of our framework in classical planning, experimenting with several test oracles and with random-walk methods generating test states biased to poor policy performance and/or state novelty. We evaluate these techniques on policies π learned with ASNets. We find that they are able to effectively identify bugs in these π, and that our random-walk biases improve over uninformed baselines.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131380831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-13DOI: 10.1609/icaps.v32i1.19840
Yangqing Zhou, Fang Wang, Zhan Shi, D. Feng
In the past few decades, much research has been conducted on the design of cache replacement policies. Prior work frequently relies on manually-engineered heuristics to capture the most common cache access patterns, or predict the reuse distance and try to identify the blocks that are either cache-friendly or cache-averse. Researchers are now applying recent advances in machine learning to guide cache replacement policy, augmenting or replacing traditional heuristics and data structures. However, most existing approaches depend on the certain environment which restricted their application, e.g, most of the approaches only consider the on-chip cache consisting of program counters (PCs). Moreover, those approaches with attractive hit rates are usually unable to deal with modern irregular workloads, due to the limited feature used. In contrast, we propose a pervasive cache replacement framework to automatically learn the relationship between the probability distribution of different replacement policies and workload distribution by using deep reinforcement learning. We train an end-to-end cache replacement policy only on the past requested address through two simple and stable cache replacement policies. Furthermore, the overall framework can be easily plugged into any scenario that requires cache. Our simulation results on 8 production storage traces run against 3 different cache configurations confirm that the proposed cache replacement policy is effective and outperforms several state-of-the-art approaches.
{"title":"An End-to-End Automatic Cache Replacement Policy Using Deep Reinforcement Learning","authors":"Yangqing Zhou, Fang Wang, Zhan Shi, D. Feng","doi":"10.1609/icaps.v32i1.19840","DOIUrl":"https://doi.org/10.1609/icaps.v32i1.19840","url":null,"abstract":"In the past few decades, much research has been conducted on the design of cache replacement policies. Prior work frequently relies on manually-engineered heuristics to capture the most common cache access patterns, or predict the reuse distance and try to identify the blocks that are either cache-friendly or cache-averse. Researchers are now applying recent advances in machine learning to guide cache replacement policy, augmenting or replacing traditional heuristics and data structures. However, most existing approaches depend on the certain environment which restricted their application, e.g, most of the approaches only consider the on-chip cache consisting of program counters (PCs). Moreover, those approaches with attractive hit rates are usually unable to deal with modern irregular workloads, due to the limited feature used. In contrast, we propose a pervasive cache replacement framework to automatically learn the relationship between the probability distribution of different replacement policies and workload distribution by using deep reinforcement learning. We train an end-to-end cache replacement policy only on the past requested address through two simple and stable cache replacement policies. Furthermore, the overall framework can be easily plugged into any scenario that requires cache. Our simulation results on 8 production storage traces run against 3 different cache configurations confirm that the proposed cache replacement policy is effective and outperforms several state-of-the-art approaches.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125537086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}