{"title":"Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & Patient Privacy","authors":"Zhenchen Wang, P. Myles, A. Tucker","doi":"10.1109/CBMS.2019.00036","DOIUrl":null,"url":null,"abstract":"There is increasing interest in the potential of synthetic data to validate and benchmark machine learning algorithms as well as reveal any biases in real-world data used for algorithm development. This paper discusses the key requirements of synthetic data for such purposes and proposes an approach to generating and evaluating synthetic data that meets these requirements. We propose a framework to generate and evaluate synthetic data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data whilst also ensuring privacy. We include as a case study, a proof-of-concept synthetic dataset modelled on UK primary care data to demonstrate the application of this framework.","PeriodicalId":311634,"journal":{"name":"2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS.2019.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25
Abstract
There is increasing interest in the potential of synthetic data to validate and benchmark machine learning algorithms as well as reveal any biases in real-world data used for algorithm development. This paper discusses the key requirements of synthetic data for such purposes and proposes an approach to generating and evaluating synthetic data that meets these requirements. We propose a framework to generate and evaluate synthetic data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data whilst also ensuring privacy. We include as a case study, a proof-of-concept synthetic dataset modelled on UK primary care data to demonstrate the application of this framework.