Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao
{"title":"面向真实世界数据的正常工作流程和数据清理的关键策略:观点。","authors":"Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao","doi":"10.2196/44310","DOIUrl":null,"url":null,"abstract":"<p><p>With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a \"data disaster.\" Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting \"dirty data,\" which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.</p>","PeriodicalId":51757,"journal":{"name":"Interactive Journal of Medical Research","volume":"12 ","pages":"e44310"},"PeriodicalIF":1.9000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/pdf/","citationCount":"0","resultStr":"{\"title\":\"Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.\",\"authors\":\"Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao\",\"doi\":\"10.2196/44310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a \\\"data disaster.\\\" Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting \\\"dirty data,\\\" which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.</p>\",\"PeriodicalId\":51757,\"journal\":{\"name\":\"Interactive Journal of Medical Research\",\"volume\":\"12 \",\"pages\":\"e44310\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interactive Journal of Medical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/44310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, RESEARCH & EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interactive Journal of Medical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/44310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.
With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a "data disaster." Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting "dirty data," which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.