DTITD:一种基于数字孪生和基于自注意的深度学习模型的智能内幕威胁检测框架

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Access Pub Date : 2023-10-13 DOI:10.1109/ACCESS.2023.3324371

Zhi Qiang Wang;Abdulmotaleb El Saddik

{"title":"DTITD:一种基于数字孪生和基于自注意的深度学习模型的智能内幕威胁检测框架","authors":"Zhi Qiang Wang;Abdulmotaleb El Saddik","doi":"10.1109/ACCESS.2023.3324371","DOIUrl":null,"url":null,"abstract":"Recent statistics and studies show that the loss generated by insider threats is much higher than that generated by external attacks. More and more organizations are investing in or purchasing insider threat detection systems to prevent insider risks. However, the accurate and timely detection of insider threats faces significant challenges. In this study, we proposed an intelligent insider threat detection framework based on Digital Twins and self-attentions based deep learning models. First, this paper introduces insider threats and the challenges in detecting them. Then this paper presents recent related works on solving insider threat detection problems and their limitations. Next, we propose our solutions to address these challenges: building an innovative intelligent insider threat detection framework based on Digital Twin (DT) and self-attention based deep learning models, performing insight analysis of users’ behavior and entities, adopting contextual word embedding techniques using Bidirectional Encoder Representations from Transformers (BERT) model and sentence embedding technique using Generative Pre-trained Transformer 2 (GPT-2) model to perform data augmentation to overcome significant data imbalance, and adopting temporal semantic representation of users’ behaviors to build user behavior time sequences. Subsequently, this study built self-attention-based deep learning models to quickly detect insider threats. This study proposes a simplified transformer model named DistilledTrans and applies the original transformer model, DistilledTrans, BERT + final layer, Robustly Optimized BERT Approach (RoBERTa) + final layer, and a hybrid method combining pre-trained (BERT, RoBERTa) with a Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) network model to detect insider threats. Finally, this paper presents experimental results on a dense dataset CERT r4.2 and augmented sporadic dataset CERT r6.2, evaluates their performance, and performs a comparison analysis with state-of-the-art models. Promising experimental results show that 1) contextual word embedding insert and substitution predicted by the BERT model, and context embedding sentences predicted by the GPT-2 model are effective data augmentation approaches to address high data imbalance; 2) DistilledTrans trained with sporadic dataset CERT r6.2 augmented by the contextual embedding sentence method predicted by GPT-2, outperforms the state-of-the-art models in terms of all evaluation metrics, including accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC). Additionally, its structure is much simpler, and thus training time and computing cost are much less than those of recent models; 3) when trained with the dense dataset CERT r4.2, pre-trained models BERT plus a final layer or RoBERTa plus a final layer can achieve significantly higher performance than the current models with a very little sacrifice of precision. However, complex hybrid methods may not be required.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"11 ","pages":"114013-114030"},"PeriodicalIF":3.4000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/6287639/10005208/10285086.pdf","citationCount":"0","resultStr":"{\"title\":\"DTITD: An Intelligent Insider Threat Detection Framework Based on Digital Twin and Self-Attention Based Deep Learning Models\",\"authors\":\"Zhi Qiang Wang;Abdulmotaleb El Saddik\",\"doi\":\"10.1109/ACCESS.2023.3324371\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent statistics and studies show that the loss generated by insider threats is much higher than that generated by external attacks. More and more organizations are investing in or purchasing insider threat detection systems to prevent insider risks. However, the accurate and timely detection of insider threats faces significant challenges. In this study, we proposed an intelligent insider threat detection framework based on Digital Twins and self-attentions based deep learning models. First, this paper introduces insider threats and the challenges in detecting them. Then this paper presents recent related works on solving insider threat detection problems and their limitations. Next, we propose our solutions to address these challenges: building an innovative intelligent insider threat detection framework based on Digital Twin (DT) and self-attention based deep learning models, performing insight analysis of users’ behavior and entities, adopting contextual word embedding techniques using Bidirectional Encoder Representations from Transformers (BERT) model and sentence embedding technique using Generative Pre-trained Transformer 2 (GPT-2) model to perform data augmentation to overcome significant data imbalance, and adopting temporal semantic representation of users’ behaviors to build user behavior time sequences. Subsequently, this study built self-attention-based deep learning models to quickly detect insider threats. This study proposes a simplified transformer model named DistilledTrans and applies the original transformer model, DistilledTrans, BERT + final layer, Robustly Optimized BERT Approach (RoBERTa) + final layer, and a hybrid method combining pre-trained (BERT, RoBERTa) with a Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) network model to detect insider threats. Finally, this paper presents experimental results on a dense dataset CERT r4.2 and augmented sporadic dataset CERT r6.2, evaluates their performance, and performs a comparison analysis with state-of-the-art models. Promising experimental results show that 1) contextual word embedding insert and substitution predicted by the BERT model, and context embedding sentences predicted by the GPT-2 model are effective data augmentation approaches to address high data imbalance; 2) DistilledTrans trained with sporadic dataset CERT r6.2 augmented by the contextual embedding sentence method predicted by GPT-2, outperforms the state-of-the-art models in terms of all evaluation metrics, including accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC). Additionally, its structure is much simpler, and thus training time and computing cost are much less than those of recent models; 3) when trained with the dense dataset CERT r4.2, pre-trained models BERT plus a final layer or RoBERTa plus a final layer can achieve significantly higher performance than the current models with a very little sacrifice of precision. However, complex hybrid methods may not be required.\",\"PeriodicalId\":13079,\"journal\":{\"name\":\"IEEE Access\",\"volume\":\"11 \",\"pages\":\"114013-114030\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2023-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/6287639/10005208/10285086.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Access\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10285086/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10285086/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

最近的统计和研究表明，内部威胁造成的损失远高于外部攻击造成的损失。越来越多的组织正在投资或购买内幕威胁检测系统，以防止内幕风险。然而，准确及时地检测内部威胁面临着重大挑战。在这项研究中，我们提出了一个基于数字双胞胎和基于自我注意的深度学习模型的智能内部威胁检测框架。首先，本文介绍了内部威胁及其检测的挑战。然后介绍了近年来解决内部威胁检测问题的相关工作及其局限性。接下来，我们提出了应对这些挑战的解决方案：构建一个基于数字孪生（DT）和基于自我关注的深度学习模型的创新智能内部威胁检测框架，对用户的行为和实体进行洞察分析，采用基于双向编码器变换器表示（BERT）模型的上下文单词嵌入技术和基于生成预训练变换器2（GPT-2）模型的句子嵌入技术来进行数据扩充以克服显著的数据不平衡，并采用用户行为的时间语义表示来构建用户行为时间序列。随后，本研究建立了基于自我关注的深度学习模型，以快速检测内部威胁。本研究提出了一个名为DistilledTrans的简化变压器模型，并应用原始变压器模型DistilledTans、BERT+最终层、鲁棒优化BERT方法（RoBERTa）+最终层，以及将预训练（BERT，RoBERTa，BERT）与卷积神经网络（CNN）或长短期记忆（LSTM）网络模型相结合的混合方法来检测内部威胁。最后，本文在密集数据集CERT r4.2和增广偶发数据集CERT r6.2上给出了实验结果，评估了它们的性能，并与最先进的模型进行了比较分析。有希望的实验结果表明：1）BERT模型预测的上下文单词嵌入插入和替换，以及GPT-2模型预测的语境嵌入句子是解决高数据不平衡的有效数据增强方法；2） DistilledTrans使用偶发数据集CERT r6.2进行训练，并辅以GPT-2预测的上下文嵌入语句方法，在所有评估指标方面，包括准确性、精确度、召回率、F1分数和ROC曲线下面积（AUC），均优于最先进的模型。此外，它的结构要简单得多，因此训练时间和计算成本比最近的模型要低得多；3）当使用密集数据集CERT r4.2进行训练时，预训练的模型BERT加上最后一层或RoBERTa加上最后层可以在几乎不牺牲精度的情况下实现比当前模型高得多的性能。然而，可能不需要复杂的混合方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DTITD: An Intelligent Insider Threat Detection Framework Based on Digital Twin and Self-Attention Based Deep Learning Models

Recent statistics and studies show that the loss generated by insider threats is much higher than that generated by external attacks. More and more organizations are investing in or purchasing insider threat detection systems to prevent insider risks. However, the accurate and timely detection of insider threats faces significant challenges. In this study, we proposed an intelligent insider threat detection framework based on Digital Twins and self-attentions based deep learning models. First, this paper introduces insider threats and the challenges in detecting them. Then this paper presents recent related works on solving insider threat detection problems and their limitations. Next, we propose our solutions to address these challenges: building an innovative intelligent insider threat detection framework based on Digital Twin (DT) and self-attention based deep learning models, performing insight analysis of users’ behavior and entities, adopting contextual word embedding techniques using Bidirectional Encoder Representations from Transformers (BERT) model and sentence embedding technique using Generative Pre-trained Transformer 2 (GPT-2) model to perform data augmentation to overcome significant data imbalance, and adopting temporal semantic representation of users’ behaviors to build user behavior time sequences. Subsequently, this study built self-attention-based deep learning models to quickly detect insider threats. This study proposes a simplified transformer model named DistilledTrans and applies the original transformer model, DistilledTrans, BERT + final layer, Robustly Optimized BERT Approach (RoBERTa) + final layer, and a hybrid method combining pre-trained (BERT, RoBERTa) with a Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) network model to detect insider threats. Finally, this paper presents experimental results on a dense dataset CERT r4.2 and augmented sporadic dataset CERT r6.2, evaluates their performance, and performs a comparison analysis with state-of-the-art models. Promising experimental results show that 1) contextual word embedding insert and substitution predicted by the BERT model, and context embedding sentences predicted by the GPT-2 model are effective data augmentation approaches to address high data imbalance; 2) DistilledTrans trained with sporadic dataset CERT r6.2 augmented by the contextual embedding sentence method predicted by GPT-2, outperforms the state-of-the-art models in terms of all evaluation metrics, including accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC). Additionally, its structure is much simpler, and thus training time and computing cost are much less than those of recent models; 3) when trained with the dense dataset CERT r4.2, pre-trained models BERT plus a final layer or RoBERTa plus a final layer can achieve significantly higher performance than the current models with a very little sacrifice of precision. However, complex hybrid methods may not be required.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.