Ning Rao, Hua Xu, Zisen Qi, Dan Wang, Yue Zhang, Xiang Peng, Lei Jiang
{"title":"The pupil outdoes the master: Imperfect demonstration-assisted trust region jamming policy optimization against frequency-hopping spread spectrum","authors":"Ning Rao, Hua Xu, Zisen Qi, Dan Wang, Yue Zhang, Xiang Peng, Lei Jiang","doi":"10.1016/j.comcom.2024.107993","DOIUrl":null,"url":null,"abstract":"<div><div>Jamming decision-making is a pivotal component of modern electromagnetic warfare, wherein recent years have witnessed the extensive application of deep reinforcement learning techniques to enhance the autonomy and intelligence of wireless communication jamming decisions. However, existing researches heavily rely on manually designed customized jamming reward functions, leading to significant consumption of human and computational resources. To this end, under the premise of obviating designing task-customized reward functions, we propose a jamming policy optimization method that learns from imperfect demonstrations to effectively address the complex and high-dimensional jamming resource allocation problem against frequency hopping spread spectrum (FHSS) communication systems. To achieve this, a policy network is meticulously architected to consecutively ascertain jamming schemes for each jamming node, facilitating the construction of the dynamic transition within the Markov decision process. Subsequently, anchored in the dual-trust region concept, we design policy improvement and policy adversarial imitation phases. During the policy improvement phase, the trust region policy optimization method is utilized to refine the policy, while the policy adversarial imitation phase employs adversarial training to guide policy exploration using information embedded in demonstrations. Extensive simulation results indicate that our proposed method can approximate the optimal jamming performance trained under customized reward functions, even with rough binary reward settings, and also significantly surpass demonstration performance.</div></div>","PeriodicalId":55224,"journal":{"name":"Computer Communications","volume":"229 ","pages":"Article 107993"},"PeriodicalIF":4.5000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Communications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0140366424003402","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Jamming decision-making is a pivotal component of modern electromagnetic warfare, wherein recent years have witnessed the extensive application of deep reinforcement learning techniques to enhance the autonomy and intelligence of wireless communication jamming decisions. However, existing researches heavily rely on manually designed customized jamming reward functions, leading to significant consumption of human and computational resources. To this end, under the premise of obviating designing task-customized reward functions, we propose a jamming policy optimization method that learns from imperfect demonstrations to effectively address the complex and high-dimensional jamming resource allocation problem against frequency hopping spread spectrum (FHSS) communication systems. To achieve this, a policy network is meticulously architected to consecutively ascertain jamming schemes for each jamming node, facilitating the construction of the dynamic transition within the Markov decision process. Subsequently, anchored in the dual-trust region concept, we design policy improvement and policy adversarial imitation phases. During the policy improvement phase, the trust region policy optimization method is utilized to refine the policy, while the policy adversarial imitation phase employs adversarial training to guide policy exploration using information embedded in demonstrations. Extensive simulation results indicate that our proposed method can approximate the optimal jamming performance trained under customized reward functions, even with rough binary reward settings, and also significantly surpass demonstration performance.
期刊介绍:
Computer and Communications networks are key infrastructures of the information society with high socio-economic value as they contribute to the correct operations of many critical services (from healthcare to finance and transportation). Internet is the core of today''s computer-communication infrastructures. This has transformed the Internet, from a robust network for data transfer between computers, to a global, content-rich, communication and information system where contents are increasingly generated by the users, and distributed according to human social relations. Next-generation network technologies, architectures and protocols are therefore required to overcome the limitations of the legacy Internet and add new capabilities and services. The future Internet should be ubiquitous, secure, resilient, and closer to human communication paradigms.
Computer Communications is a peer-reviewed international journal that publishes high-quality scientific articles (both theory and practice) and survey papers covering all aspects of future computer communication networks (on all layers, except the physical layer), with a special attention to the evolution of the Internet architecture, protocols, services, and applications.