Yuning Cui, Yonghui Huang, Yongbing Bai, Yuchen Wang, Chao Wang
{"title":"Sensitive data identification for multi-category and multi-scenario data","authors":"Yuning Cui, Yonghui Huang, Yongbing Bai, Yuchen Wang, Chao Wang","doi":"10.1002/ett.4983","DOIUrl":null,"url":null,"abstract":"<p>Sensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise-level data protection. This paper proposes an introduction to the end-to-end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window-based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.</p>","PeriodicalId":23282,"journal":{"name":"Transactions on Emerging Telecommunications Technologies","volume":"35 5","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Emerging Telecommunications Technologies","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ett.4983","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Sensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise-level data protection. This paper proposes an introduction to the end-to-end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window-based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.
期刊介绍:
ransactions on Emerging Telecommunications Technologies (ETT), formerly known as European Transactions on Telecommunications (ETT), has the following aims:
- to attract cutting-edge publications from leading researchers and research groups around the world
- to become a highly cited source of timely research findings in emerging fields of telecommunications
- to limit revision and publication cycles to a few months and thus significantly increase attractiveness to publish
- to become the leading journal for publishing the latest developments in telecommunications