Y. Vasilev, T. Bobrovskaya, K. Arzamasov, S. Chetverikov, A. Vladzymyrskyy, O. Omelyanskaya, A. Andreychenko, N. Pavlov, L. N. Anishchenko
{"title":"机器学习的医疗数据集:标准化和系统化的基本原则","authors":"Y. Vasilev, T. Bobrovskaya, K. Arzamasov, S. Chetverikov, A. Vladzymyrskyy, O. Omelyanskaya, A. Andreychenko, N. Pavlov, L. N. Anishchenko","doi":"10.21045/1811-0185-2023-4-28-41","DOIUrl":null,"url":null,"abstract":"Backgraund: Active implementation of artificial intelligence technologies in the healthcare in recent years promotes increasing amount of medical data for the development of machine learning models, including radiology and instrumental diagnostics data. To solve various problems of digital medical technologies, new datasets are being created through machine learning algorithms, therefore, the problems of their systematization and standardization, storage, access, rational and safe use become actual. A i m : development of an approach to systematization and standardization of information about datasets to represent, store, apply and optimize the use of datasets and ensure the safety and transparency of the development and testing of medical devices using artificial intelligence. M a t e r i a l s a n d m e t h o d s : analysis of own and international experience in the creation and use of medical datasets, medical reference books searching and analysis, registry structure development and justification, scientific publications search with the keywords “datasets”, “registry of medical data”, placed in the databases of the RSCI, Scopus, Web of Science. R e s u l t s . The register of medical instrumental diagnostics datasets structure has been developed in accordance with stages of datasets lifecycle: 7 parameters at the initiation stage, 8 – at the planning stage, 70 – dataset card, 1 – version change, 14 – at the use stage, total – 100 parameters. We propose datasets classification according to the purpose of their creation, a classification of data verification methods, as well as the principles of forming names for standardization and datasets presentation clarity. In addition, the main features of the organization of maintaining this registry are highlighted: management, data quality, confidentiality and security. C o n c l u s i o n s . For the first time, an original technology of medical datasets for instrumental diagnostics structuring and systematization is proposed. It is based on the developed terminology and principles of information classification. This makes it possible to standardize the structure of information about datasets for machine learning, and ensures the storage centralization. It also allows to get quick access to all information about the dataset, and ensure transparency, reliability and reproducibility of artificial intelligence developments. Creating a registry makes it possible to quickly form visual data libraries. This allows a wide range of researchers, developers and companies to choose data sets for their tasks. This approach ensures their widespread use, resource optimization and contributes to the rapid development and implementation of artificial intelligence.","PeriodicalId":270155,"journal":{"name":"Manager Zdravookhranenia","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Medical datasets for machine learning: fundamental principles of standartization and systematization\",\"authors\":\"Y. Vasilev, T. Bobrovskaya, K. Arzamasov, S. Chetverikov, A. Vladzymyrskyy, O. Omelyanskaya, A. Andreychenko, N. Pavlov, L. N. Anishchenko\",\"doi\":\"10.21045/1811-0185-2023-4-28-41\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Backgraund: Active implementation of artificial intelligence technologies in the healthcare in recent years promotes increasing amount of medical data for the development of machine learning models, including radiology and instrumental diagnostics data. To solve various problems of digital medical technologies, new datasets are being created through machine learning algorithms, therefore, the problems of their systematization and standardization, storage, access, rational and safe use become actual. A i m : development of an approach to systematization and standardization of information about datasets to represent, store, apply and optimize the use of datasets and ensure the safety and transparency of the development and testing of medical devices using artificial intelligence. M a t e r i a l s a n d m e t h o d s : analysis of own and international experience in the creation and use of medical datasets, medical reference books searching and analysis, registry structure development and justification, scientific publications search with the keywords “datasets”, “registry of medical data”, placed in the databases of the RSCI, Scopus, Web of Science. R e s u l t s . The register of medical instrumental diagnostics datasets structure has been developed in accordance with stages of datasets lifecycle: 7 parameters at the initiation stage, 8 – at the planning stage, 70 – dataset card, 1 – version change, 14 – at the use stage, total – 100 parameters. We propose datasets classification according to the purpose of their creation, a classification of data verification methods, as well as the principles of forming names for standardization and datasets presentation clarity. In addition, the main features of the organization of maintaining this registry are highlighted: management, data quality, confidentiality and security. C o n c l u s i o n s . For the first time, an original technology of medical datasets for instrumental diagnostics structuring and systematization is proposed. It is based on the developed terminology and principles of information classification. This makes it possible to standardize the structure of information about datasets for machine learning, and ensures the storage centralization. It also allows to get quick access to all information about the dataset, and ensure transparency, reliability and reproducibility of artificial intelligence developments. Creating a registry makes it possible to quickly form visual data libraries. This allows a wide range of researchers, developers and companies to choose data sets for their tasks. This approach ensures their widespread use, resource optimization and contributes to the rapid development and implementation of artificial intelligence.\",\"PeriodicalId\":270155,\"journal\":{\"name\":\"Manager Zdravookhranenia\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Manager Zdravookhranenia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21045/1811-0185-2023-4-28-41\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Manager Zdravookhranenia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21045/1811-0185-2023-4-28-41","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
背景:近年来,人工智能技术在医疗保健领域的积极应用,促进了越来越多的医疗数据用于机器学习模型的开发,包括放射学和仪器诊断数据。为了解决数字医疗技术的各种问题,机器学习算法正在创建新的数据集,因此,它们的系统化和标准化、存储、访问、合理和安全使用问题成为现实。A im:制定数据集信息系统化和标准化的方法,以表示、存储、应用和优化数据集的使用,并确保使用人工智能开发和测试医疗设备的安全性和透明度。我的工作内容包括:分析自己和国际在创建和使用医疗数据集方面的经验,医学参考书的搜索和分析,注册表结构的开发和论证,科学出版物搜索关键词“数据集”,“医疗数据注册表”,放在RSCI、Scopus、Web of Science的数据库中。这是我最喜欢的。医疗仪器诊断数据集注册结构按照数据集生命周期阶段制定:启动阶段7个参数,规划阶段8个参数,数据集卡70个参数,版本变更1个参数,使用阶段14个参数,总共100个参数。我们根据数据集创建的目的提出了数据集分类、数据验证方法的分类以及标准化和数据集呈现清晰的形成名称的原则。此外,还强调了维护该注册表的组织的主要特征:管理、数据质量、机密性和安全性。我想我的孩子们都是这样的。首次提出了一种用于仪器诊断结构化和系统化的医疗数据集的原始技术。它基于已开发的术语和信息分类原则。这使得机器学习数据集的信息结构标准化成为可能,并保证了存储的集中化。它还允许快速访问有关数据集的所有信息,并确保人工智能发展的透明度、可靠性和可重复性。创建注册表使快速形成可视化数据库成为可能。这允许广泛的研究人员、开发人员和公司为他们的任务选择数据集。这种方法保证了它们的广泛使用,资源优化,有助于人工智能的快速发展和实施。
Medical datasets for machine learning: fundamental principles of standartization and systematization
Backgraund: Active implementation of artificial intelligence technologies in the healthcare in recent years promotes increasing amount of medical data for the development of machine learning models, including radiology and instrumental diagnostics data. To solve various problems of digital medical technologies, new datasets are being created through machine learning algorithms, therefore, the problems of their systematization and standardization, storage, access, rational and safe use become actual. A i m : development of an approach to systematization and standardization of information about datasets to represent, store, apply and optimize the use of datasets and ensure the safety and transparency of the development and testing of medical devices using artificial intelligence. M a t e r i a l s a n d m e t h o d s : analysis of own and international experience in the creation and use of medical datasets, medical reference books searching and analysis, registry structure development and justification, scientific publications search with the keywords “datasets”, “registry of medical data”, placed in the databases of the RSCI, Scopus, Web of Science. R e s u l t s . The register of medical instrumental diagnostics datasets structure has been developed in accordance with stages of datasets lifecycle: 7 parameters at the initiation stage, 8 – at the planning stage, 70 – dataset card, 1 – version change, 14 – at the use stage, total – 100 parameters. We propose datasets classification according to the purpose of their creation, a classification of data verification methods, as well as the principles of forming names for standardization and datasets presentation clarity. In addition, the main features of the organization of maintaining this registry are highlighted: management, data quality, confidentiality and security. C o n c l u s i o n s . For the first time, an original technology of medical datasets for instrumental diagnostics structuring and systematization is proposed. It is based on the developed terminology and principles of information classification. This makes it possible to standardize the structure of information about datasets for machine learning, and ensures the storage centralization. It also allows to get quick access to all information about the dataset, and ensure transparency, reliability and reproducibility of artificial intelligence developments. Creating a registry makes it possible to quickly form visual data libraries. This allows a wide range of researchers, developers and companies to choose data sets for their tasks. This approach ensures their widespread use, resource optimization and contributes to the rapid development and implementation of artificial intelligence.