Alec Wilson, William Holmes, Ryan Menzies, Kez Smithson Whitehead
{"title":"应用动作屏蔽和课程学习技术,利用强化学习提高操作技术网络安全的数据效率和整体性能","authors":"Alec Wilson, William Holmes, Ryan Menzies, Kez Smithson Whitehead","doi":"arxiv-2409.10563","DOIUrl":null,"url":null,"abstract":"In previous work, the IPMSRL environment (Integrated Platform Management\nSystem Reinforcement Learning environment) was developed with the aim of\ntraining defensive RL agents in a simulator representing a subset of an IPMS on\na maritime vessel under a cyber-attack. This paper extends the use of IPMSRL to\nenhance realism including the additional dynamics of false positive alerts and\nalert delay. Applying curriculum learning, in the most difficult environment\ntested, resulted in an episode reward mean increasing from a baseline result of\n-2.791 to -0.569. Applying action masking, in the most difficult environment\ntested, resulted in an episode reward mean increasing from a baseline result of\n-2.791 to -0.743. Importantly, this level of performance was reached in less\nthan 1 million timesteps, which was far more data efficient than vanilla PPO\nwhich reached a lower level of performance after 2.5 million timesteps. The\ntraining method which resulted in the highest level of performance observed in\nthis paper was a combination of the application of curriculum learning and\naction masking, with a mean episode reward of 0.137. This paper also introduces\na basic hardcoded defensive agent encoding a representation of cyber security\nbest practice, which provides context to the episode reward mean figures\nreached by the RL agents. The hardcoded agent managed an episode reward mean of\n-1.895. This paper therefore shows that applications of curriculum learning and\naction masking, both independently and in tandem, present a way to overcome the\ncomplex real-world dynamics that are present in operational technology cyber\nsecurity threat remediation.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Applying Action Masking and Curriculum Learning Techniques to Improve Data Efficiency and Overall Performance in Operational Technology Cyber Security using Reinforcement Learning\",\"authors\":\"Alec Wilson, William Holmes, Ryan Menzies, Kez Smithson Whitehead\",\"doi\":\"arxiv-2409.10563\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In previous work, the IPMSRL environment (Integrated Platform Management\\nSystem Reinforcement Learning environment) was developed with the aim of\\ntraining defensive RL agents in a simulator representing a subset of an IPMS on\\na maritime vessel under a cyber-attack. This paper extends the use of IPMSRL to\\nenhance realism including the additional dynamics of false positive alerts and\\nalert delay. Applying curriculum learning, in the most difficult environment\\ntested, resulted in an episode reward mean increasing from a baseline result of\\n-2.791 to -0.569. Applying action masking, in the most difficult environment\\ntested, resulted in an episode reward mean increasing from a baseline result of\\n-2.791 to -0.743. Importantly, this level of performance was reached in less\\nthan 1 million timesteps, which was far more data efficient than vanilla PPO\\nwhich reached a lower level of performance after 2.5 million timesteps. The\\ntraining method which resulted in the highest level of performance observed in\\nthis paper was a combination of the application of curriculum learning and\\naction masking, with a mean episode reward of 0.137. This paper also introduces\\na basic hardcoded defensive agent encoding a representation of cyber security\\nbest practice, which provides context to the episode reward mean figures\\nreached by the RL agents. The hardcoded agent managed an episode reward mean of\\n-1.895. This paper therefore shows that applications of curriculum learning and\\naction masking, both independently and in tandem, present a way to overcome the\\ncomplex real-world dynamics that are present in operational technology cyber\\nsecurity threat remediation.\",\"PeriodicalId\":501332,\"journal\":{\"name\":\"arXiv - CS - Cryptography and Security\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Cryptography and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10563\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Applying Action Masking and Curriculum Learning Techniques to Improve Data Efficiency and Overall Performance in Operational Technology Cyber Security using Reinforcement Learning
In previous work, the IPMSRL environment (Integrated Platform Management
System Reinforcement Learning environment) was developed with the aim of
training defensive RL agents in a simulator representing a subset of an IPMS on
a maritime vessel under a cyber-attack. This paper extends the use of IPMSRL to
enhance realism including the additional dynamics of false positive alerts and
alert delay. Applying curriculum learning, in the most difficult environment
tested, resulted in an episode reward mean increasing from a baseline result of
-2.791 to -0.569. Applying action masking, in the most difficult environment
tested, resulted in an episode reward mean increasing from a baseline result of
-2.791 to -0.743. Importantly, this level of performance was reached in less
than 1 million timesteps, which was far more data efficient than vanilla PPO
which reached a lower level of performance after 2.5 million timesteps. The
training method which resulted in the highest level of performance observed in
this paper was a combination of the application of curriculum learning and
action masking, with a mean episode reward of 0.137. This paper also introduces
a basic hardcoded defensive agent encoding a representation of cyber security
best practice, which provides context to the episode reward mean figures
reached by the RL agents. The hardcoded agent managed an episode reward mean of
-1.895. This paper therefore shows that applications of curriculum learning and
action masking, both independently and in tandem, present a way to overcome the
complex real-world dynamics that are present in operational technology cyber
security threat remediation.