{"title":"A Sequential Experience-driven Contextual Bandit Policy for MIMO TWAF Online Relay Selection","authors":"Ankit Gupta, M. Sellathurai, T. Ratnarajah","doi":"10.1109/spawc51304.2022.9834018","DOIUrl":null,"url":null,"abstract":"In this work, we derive a sequential experience-driven contextual bandit (CB)-based policies for online relay selection in multiple-input multiple-output (MIMO) two-way amplify-and-forward (TWAF) relay networks, where the relays are provided with quantized imperfect channel gain information. The proposed CB-based policy acquires information about the optimal relay node by resolving the exploration-versus-exploitation dilemma. In particular, we propose a linear upper confidence bound (LinUCB)-based CB policy, and an adaptive active greedy (AAG)-based CB policy that utilizes active learning heuristics. With simulation results, we show that the proposed CB-based policies can reduce the feedback overhead by a factor of eight and time-cost by 70% while outperforming the best conventional Gram-Schmidt (GS) algorithm.","PeriodicalId":423807,"journal":{"name":"2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/spawc51304.2022.9834018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this work, we derive a sequential experience-driven contextual bandit (CB)-based policies for online relay selection in multiple-input multiple-output (MIMO) two-way amplify-and-forward (TWAF) relay networks, where the relays are provided with quantized imperfect channel gain information. The proposed CB-based policy acquires information about the optimal relay node by resolving the exploration-versus-exploitation dilemma. In particular, we propose a linear upper confidence bound (LinUCB)-based CB policy, and an adaptive active greedy (AAG)-based CB policy that utilizes active learning heuristics. With simulation results, we show that the proposed CB-based policies can reduce the feedback overhead by a factor of eight and time-cost by 70% while outperforming the best conventional Gram-Schmidt (GS) algorithm.