{"title":"一个新的框架:随机多臂强盗的短期和长期收益","authors":"Abdalaziz Sawwan, Jie Wu","doi":"10.1109/INFOCOM53939.2023.10228899","DOIUrl":null,"url":null,"abstract":"Stochastic Multi-Armed Bandit (MAB) has recently been studied widely due to its vast range of applications. The classic model considers the reward of a pulled arm to be observed after a time delay that is sampled from a random distribution assigned for each arm. In this paper, we propose an extended framework in which pulling an arm gives both an instant (short-term) reward and a delayed (long-term) reward at the same time. The distributions of reward values for short-term and long-term rewards are related with a previously known relationship. The distribution of time delay for an arm is independent of the reward distributions of the arm. In our work, we devise three UCB-based algorithms, where two of them are near-optimal-regret algorithms for this new model, with the corresponding regret analysis for each one of them. Additionally, the random distributions for time delay values are allowed to yield infinite time, which corresponds to a case where the arm only gives a short-term reward. Finally, we evaluate our algorithms and compare this paradigm with previously known models on both a synthetic data set and a real data set that would reflect one of the potential applications of this model.","PeriodicalId":387707,"journal":{"name":"IEEE INFOCOM 2023 - IEEE Conference on Computer Communications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A New Framework: Short-Term and Long-Term Returns in Stochastic Multi-Armed Bandit\",\"authors\":\"Abdalaziz Sawwan, Jie Wu\",\"doi\":\"10.1109/INFOCOM53939.2023.10228899\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stochastic Multi-Armed Bandit (MAB) has recently been studied widely due to its vast range of applications. The classic model considers the reward of a pulled arm to be observed after a time delay that is sampled from a random distribution assigned for each arm. In this paper, we propose an extended framework in which pulling an arm gives both an instant (short-term) reward and a delayed (long-term) reward at the same time. The distributions of reward values for short-term and long-term rewards are related with a previously known relationship. The distribution of time delay for an arm is independent of the reward distributions of the arm. In our work, we devise three UCB-based algorithms, where two of them are near-optimal-regret algorithms for this new model, with the corresponding regret analysis for each one of them. Additionally, the random distributions for time delay values are allowed to yield infinite time, which corresponds to a case where the arm only gives a short-term reward. Finally, we evaluate our algorithms and compare this paradigm with previously known models on both a synthetic data set and a real data set that would reflect one of the potential applications of this model.\",\"PeriodicalId\":387707,\"journal\":{\"name\":\"IEEE INFOCOM 2023 - IEEE Conference on Computer Communications\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE INFOCOM 2023 - IEEE Conference on Computer Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFOCOM53939.2023.10228899\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE INFOCOM 2023 - IEEE Conference on Computer Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOM53939.2023.10228899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A New Framework: Short-Term and Long-Term Returns in Stochastic Multi-Armed Bandit
Stochastic Multi-Armed Bandit (MAB) has recently been studied widely due to its vast range of applications. The classic model considers the reward of a pulled arm to be observed after a time delay that is sampled from a random distribution assigned for each arm. In this paper, we propose an extended framework in which pulling an arm gives both an instant (short-term) reward and a delayed (long-term) reward at the same time. The distributions of reward values for short-term and long-term rewards are related with a previously known relationship. The distribution of time delay for an arm is independent of the reward distributions of the arm. In our work, we devise three UCB-based algorithms, where two of them are near-optimal-regret algorithms for this new model, with the corresponding regret analysis for each one of them. Additionally, the random distributions for time delay values are allowed to yield infinite time, which corresponds to a case where the arm only gives a short-term reward. Finally, we evaluate our algorithms and compare this paradigm with previously known models on both a synthetic data set and a real data set that would reflect one of the potential applications of this model.