Establishing of big data clinical dataset in brain vessel aneurysm research

Q4 Biochemistry, Genetics and Molecular Biology Sibirskii nauchnyi meditsinskii zhurnal Pub Date : 2023-06-23 DOI:10.18699/ssmj20230311

Ju. V. Kivelev, I. Saarenpää, A. Krivoshapkin

{"title":"Establishing of big data clinical dataset in brain vessel aneurysm research","authors":"Ju. V. Kivelev, I. Saarenpää, A. Krivoshapkin","doi":"10.18699/ssmj20230311","DOIUrl":null,"url":null,"abstract":"Variability and heterogeneity of digital medical data requires establishing of modern algorithms which provide appropriate data processing. The aim of the study was to delineate the main steps in formation of a clinical dataset of patients with brain aneurysms from the stage of producing primary mining specifications to formation of a final version.Material and methods. Data collection, crosschecking of the cases and analyses of dataset has been carried out in Turku University Hospital. Within last two decades available medical data at our hospital have been stored in digital data lake thus allowing automatized data mining. In frame of our study, data mining was performed by a data scientist utilizing R software. Inclusion criteria were based on a set of diagnosis which were coded in medical charts according to international classification of diseases (ICD 10).Resutls and Discussion. Primary data mining identified 3850 patients with brain aneurysms treated at our hospital from January 2000 till May 2018. After independent manual crosschecking of medical charts of these patients, we found 1218 (32 %) cases, which had no aneurysm (false-positive). Data of remaining true aneurysm-cases were divided into clinical and intensive care unit subsets where every event linked to particular date of treatment was defined as an info-unit. All the data in both subsets were structured into separate Excel files and presented in chronological order for each particular patient. Altogether, dataset included 70 000 000 rows of info-units found in 2632 patients.Conclusions. Data mining allowed establishment of detailed clinical dataset of patients with brain aneurysms. Produced mining algorithm had limitation regarding false-positive cases (32 % patients). Based on that, we recommend manual crosschecking of automatically collected dataset before statistical analysis.","PeriodicalId":33781,"journal":{"name":"Sibirskii nauchnyi meditsinskii zhurnal","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sibirskii nauchnyi meditsinskii zhurnal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18699/ssmj20230311","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 0

Abstract

Variability and heterogeneity of digital medical data requires establishing of modern algorithms which provide appropriate data processing. The aim of the study was to delineate the main steps in formation of a clinical dataset of patients with brain aneurysms from the stage of producing primary mining specifications to formation of a final version.Material and methods. Data collection, crosschecking of the cases and analyses of dataset has been carried out in Turku University Hospital. Within last two decades available medical data at our hospital have been stored in digital data lake thus allowing automatized data mining. In frame of our study, data mining was performed by a data scientist utilizing R software. Inclusion criteria were based on a set of diagnosis which were coded in medical charts according to international classification of diseases (ICD 10).Resutls and Discussion. Primary data mining identified 3850 patients with brain aneurysms treated at our hospital from January 2000 till May 2018. After independent manual crosschecking of medical charts of these patients, we found 1218 (32 %) cases, which had no aneurysm (false-positive). Data of remaining true aneurysm-cases were divided into clinical and intensive care unit subsets where every event linked to particular date of treatment was defined as an info-unit. All the data in both subsets were structured into separate Excel files and presented in chronological order for each particular patient. Altogether, dataset included 70 000 000 rows of info-units found in 2632 patients.Conclusions. Data mining allowed establishment of detailed clinical dataset of patients with brain aneurysms. Produced mining algorithm had limitation regarding false-positive cases (32 % patients). Based on that, we recommend manual crosschecking of automatically collected dataset before statistical analysis.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

脑血管动脉瘤研究大数据临床数据集的建立

数字医疗数据的可变性和异质性要求建立提供适当数据处理的现代算法。该研究的目的是描述形成脑动脉瘤患者临床数据集的主要步骤，从产生主要挖掘规范到形成最终版本。材料和方法。图尔库大学医院进行了数据收集、病例交叉检查和数据集分析。在过去的二十年里，我们医院的可用医疗数据已经存储在数字数据湖中，从而实现了自动化的数据挖掘。在我们的研究框架中，数据挖掘是由一位数据科学家利用R软件进行的。纳入标准基于根据国际疾病分类（ICD 10）在医学图表中编码的一组诊断。重申和讨论。从2000年1月到2018年5月，初步数据挖掘确定了3850名在我院接受治疗的脑动脉瘤患者。在对这些患者的病历进行独立手动交叉检查后，我们发现1218例（32%）没有动脉瘤（假阳性）。剩余真实动脉瘤病例的数据被分为临床和重症监护室亚组，其中与特定治疗日期相关的每个事件都被定义为一个信息单元。两个子集中的所有数据都被结构化到单独的Excel文件中，并按每个特定患者的时间顺序显示。总的来说，数据集包括在2632名患者中发现的7000000行信息单元。结论。数据挖掘允许建立脑动脉瘤患者的详细临床数据集。生成的挖掘算法对假阳性病例（32%的患者）有局限性。基于此，我们建议在统计分析之前对自动收集的数据集进行手动交叉检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊