具有凸成本函数的结构化mdp学习:改进的库存管理后悔边界

Proceedings of the 2019 ACM Conference on Economics and Computation Pub Date : 2019-05-10 DOI:10.1145/3328526.3329565

Shipra Agrawal, Randy Jia

{"title":"具有凸成本函数的结构化mdp学习:改进的库存管理后悔边界","authors":"Shipra Agrawal, Randy Jia","doi":"10.1145/3328526.3329565","DOIUrl":null,"url":null,"abstract":"We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called \"base-stock policies\" for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of ~O (L√T+D) for the inventory control problem. Here L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and L-dimensional, our regret bounds depend linearly on L. Our results significantly improve the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.","PeriodicalId":416173,"journal":{"name":"Proceedings of the 2019 ACM Conference on Economics and Computation","volume":"85 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":"{\"title\":\"Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management\",\"authors\":\"Shipra Agrawal, Randy Jia\",\"doi\":\"10.1145/3328526.3329565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called \\\"base-stock policies\\\" for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of ~O (L√T+D) for the inventory control problem. Here L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and L-dimensional, our regret bounds depend linearly on L. Our results significantly improve the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.\",\"PeriodicalId\":416173,\"journal\":{\"name\":\"Proceedings of the 2019 ACM Conference on Economics and Computation\",\"volume\":\"85 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"39\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 ACM Conference on Economics and Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3328526.3329565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM Conference on Economics and Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3328526.3329565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

摘要

我们考虑一个随机库存控制问题在审查需求，销售损失，和积极的交货时间。这是库存管理中的一个基本问题，有大量文献为基础马尔可夫决策过程(MDP)建立了一类简单的策略(称为“基本库存策略”)的近最优性，以及这些策略下长期平均成本的凸性。我们考虑了一个相对较少研究的问题，即在潜在需求分布未知的情况下，为这个问题设计一个学习算法。目标是在与最佳基本库存策略进行比较时约束算法的遗憾。我们利用基-股策略的凸性性质和新导出的偏置界来建立与随机凸群优化的联系。我们的主要贡献是一个具有~O (L√T+D)遗憾界的学习算法来解决库存控制问题。这里L是固定且已知的交货时间，D是需求分布的未知参数，大致描述为消耗一单位库存所需的产生足够需求所需的时间步数。值得注意的是，即使底层MDP的状态空间是连续的和L维的，我们的遗憾界仍然线性地依赖于L。我们的结果显著地改进了这个问题之前最著名的遗憾界，其中对L的依赖是指数的，并且需要对需求分布进行许多进一步的假设。这里介绍的技术可能对涉及大型结构化mdp但具有凸成本函数的其他设置具有独立的兴趣。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management

We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called "base-stock policies" for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of ~O (L√T+D) for the inventory control problem. Here L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and L-dimensional, our regret bounds depend linearly on L. Our results significantly improve the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2019 ACM Conference on Economics and Computation

自引率

0.00%

发文量

期刊最新文献

Computing Core-Stable Outcomes in Combinatorial Exchanges with Financially Constrained Bidders No Stratification Without Representation How to Sell a Dataset? Pricing Policies for Data Monetization Prophet Inequalities for I.I.D. Random Variables from an Unknown Distribution Incorporating Compatible Pairs in Kidney Exchange: A Dynamic Weighted Matching Model