基于国民健康数据的多癌症风险分层：一项回顾性建模和验证研究

IF 23.8 1区医学 Q1 MEDICAL INFORMATICS Lancet Digital Health Pub Date : 2024-05-22 DOI:10.1016/S2589-7500(24)00062-1

Alexander W Jung PhD , Peter C Holm MSc , Kumar Gaurav PhD , Jessica Xin Hjaltelin PhD , Davide Placido PhD , Prof Laust Hvas Mortensen PhD , Prof Ewan Birney PhD , Prof S⊘ren Brunak PhD , Prof Moritz Gerstung PhD

{"title":"基于国民健康数据的多癌症风险分层：一项回顾性建模和验证研究","authors":"Alexander W Jung PhD , Peter C Holm MSc , Kumar Gaurav PhD , Jessica Xin Hjaltelin PhD , Davide Placido PhD , Prof Laust Hvas Mortensen PhD , Prof Ewan Birney PhD , Prof S⊘ren Brunak PhD , Prof Moritz Gerstung PhD","doi":"10.1016/S2589-7500(24)00062-1","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Health care is experiencing a drive towards digitisation, and many countries are implementing national health data resources. Although a range of cancer risk models exists, the utility on a population level for risk stratification across cancer types has not been fully explored. We aimed to close this gap by evaluating pan-cancer risk models built on electronic health records across the Danish population with validation in the UK Biobank.</p></div><div><h3>Methods</h3><p>In this retrospective modelling and validation study, data for model development and internal validation were derived from the following Danish health registries: the Central Person Registry, the Danish National Patient Registry, the death registry, the cancer registry, and full-text medical records from secondary care records in the capital region. The development data included adults aged 16–86 years without previous malignant cancers in the time period from Jan 1, 1995, to Dec 31, 2014. The internal validation period was from Jan 1, 2015, to April 10, 2018, and the data included all adults without a previous indication of cancer aged 16–75 years on Dec 31, 2014. The external validation cohort from the UK Biobank included all adults without a previous indication of cancer aged 50–75 years. We used time-dependent Bayesian Cox hazard models built on the combined medical history of Danish individuals. A set of 1392 covariates from available clinical disease trajectories, text-mined basic health factors, and family histories were used to train predictive models of 20 major cancer types. The models were validated on cancer incidence between 2015 and 2018 across Denmark and on individuals in the UK Biobank. The primary outcomes were discrimination and calibration performance.</p></div><div><h3>Findings</h3><p>From the Danish registries, we included 6 732 553 individuals covering 60 million hospital visits, 90 million diagnoses, and a total of 193 million life-years between Jan 1, 1978, and April 10, 2018. Danish registry data covering the period from Jan 1, 2015, to April 10, 2018, were used to internally validate risk models, containing a total of 4 248 491 individuals who remained at risk of a primary malignant cancer diagnosis and 67 401 cancer cases recorded. For the external validation, we evaluated the same time period in the UK Biobank covering 377 004 individuals with 11 486 cancer cases. The predictive performance of the models on Danish data showed good discrimination (concordance index 0·81 [SD 0·08], ranging from 0·66 [95% CI 0·65–0·67] for cervix uteri cancer to 0·91 [0·90–0·92] for liver cancer). Performance was similar on the UK Biobank in a direct transfer when controlling for shifts in the age distribution (concordance index 0·66 [SD 0·08], ranging from 0·55 [95% CI 0·44–0·66] for cervix uteri cancer to 0·78 [0·77–0·79] for lung cancer). Cancer risks were associated, in addition to heritable components, with a broad range of preceding diagnoses and health factors. The best overall performance was seen for cancers of the digestive system (oesophageal, stomach, colorectal, liver, and pancreatic) but also thyroid, kidney, and uterine cancers.</p></div><div><h3>Interpretation</h3><p>Data available in national electronic health databases can be used to approximate cancer risk factors and enable risk predictions in most cancer types. Model predictions generalise between the Danish and UK health-care systems. With the emergence of multi-cancer early detection tests, electronic health record-based risk models could supplement screening efforts.</p></div><div><h3>Funding</h3><p>Novo Nordisk Foundation and the Danish Innovation Foundation.</p></div>","PeriodicalId":48534,"journal":{"name":"Lancet Digital Health","volume":"6 6","pages":"Pages e396-e406"},"PeriodicalIF":23.8000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2589750024000621/pdfft?md5=cd2cb89cd08e988338ffea88e7f08924&pid=1-s2.0-S2589750024000621-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study\",\"authors\":\"Alexander W Jung PhD , Peter C Holm MSc , Kumar Gaurav PhD , Jessica Xin Hjaltelin PhD , Davide Placido PhD , Prof Laust Hvas Mortensen PhD , Prof Ewan Birney PhD , Prof S⊘ren Brunak PhD , Prof Moritz Gerstung PhD\",\"doi\":\"10.1016/S2589-7500(24)00062-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>Health care is experiencing a drive towards digitisation, and many countries are implementing national health data resources. Although a range of cancer risk models exists, the utility on a population level for risk stratification across cancer types has not been fully explored. We aimed to close this gap by evaluating pan-cancer risk models built on electronic health records across the Danish population with validation in the UK Biobank.</p></div><div><h3>Methods</h3><p>In this retrospective modelling and validation study, data for model development and internal validation were derived from the following Danish health registries: the Central Person Registry, the Danish National Patient Registry, the death registry, the cancer registry, and full-text medical records from secondary care records in the capital region. The development data included adults aged 16–86 years without previous malignant cancers in the time period from Jan 1, 1995, to Dec 31, 2014. The internal validation period was from Jan 1, 2015, to April 10, 2018, and the data included all adults without a previous indication of cancer aged 16–75 years on Dec 31, 2014. The external validation cohort from the UK Biobank included all adults without a previous indication of cancer aged 50–75 years. We used time-dependent Bayesian Cox hazard models built on the combined medical history of Danish individuals. A set of 1392 covariates from available clinical disease trajectories, text-mined basic health factors, and family histories were used to train predictive models of 20 major cancer types. The models were validated on cancer incidence between 2015 and 2018 across Denmark and on individuals in the UK Biobank. The primary outcomes were discrimination and calibration performance.</p></div><div><h3>Findings</h3><p>From the Danish registries, we included 6 732 553 individuals covering 60 million hospital visits, 90 million diagnoses, and a total of 193 million life-years between Jan 1, 1978, and April 10, 2018. Danish registry data covering the period from Jan 1, 2015, to April 10, 2018, were used to internally validate risk models, containing a total of 4 248 491 individuals who remained at risk of a primary malignant cancer diagnosis and 67 401 cancer cases recorded. For the external validation, we evaluated the same time period in the UK Biobank covering 377 004 individuals with 11 486 cancer cases. The predictive performance of the models on Danish data showed good discrimination (concordance index 0·81 [SD 0·08], ranging from 0·66 [95% CI 0·65–0·67] for cervix uteri cancer to 0·91 [0·90–0·92] for liver cancer). Performance was similar on the UK Biobank in a direct transfer when controlling for shifts in the age distribution (concordance index 0·66 [SD 0·08], ranging from 0·55 [95% CI 0·44–0·66] for cervix uteri cancer to 0·78 [0·77–0·79] for lung cancer). Cancer risks were associated, in addition to heritable components, with a broad range of preceding diagnoses and health factors. The best overall performance was seen for cancers of the digestive system (oesophageal, stomach, colorectal, liver, and pancreatic) but also thyroid, kidney, and uterine cancers.</p></div><div><h3>Interpretation</h3><p>Data available in national electronic health databases can be used to approximate cancer risk factors and enable risk predictions in most cancer types. Model predictions generalise between the Danish and UK health-care systems. With the emergence of multi-cancer early detection tests, electronic health record-based risk models could supplement screening efforts.</p></div><div><h3>Funding</h3><p>Novo Nordisk Foundation and the Danish Innovation Foundation.</p></div>\",\"PeriodicalId\":48534,\"journal\":{\"name\":\"Lancet Digital Health\",\"volume\":\"6 6\",\"pages\":\"Pages e396-e406\"},\"PeriodicalIF\":23.8000,\"publicationDate\":\"2024-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2589750024000621/pdfft?md5=cd2cb89cd08e988338ffea88e7f08924&pid=1-s2.0-S2589750024000621-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Lancet Digital Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2589750024000621\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lancet Digital Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589750024000621","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景医疗保健正经历着数字化进程，许多国家正在实施国家健康数据资源。虽然存在一系列癌症风险模型，但在人群层面对不同癌症类型进行风险分层的实用性尚未得到充分探讨。在这项回顾性建模和验证研究中，用于模型开发和内部验证的数据来自以下丹麦健康登记处：中央人员登记处、丹麦全国患者登记处、死亡登记处、癌症登记处，以及首都地区二级医疗记录的全文医疗记录。开发数据包括 1995 年 1 月 1 日至 2014 年 12 月 31 日期间年龄在 16-86 岁之间、既往未患恶性癌症的成年人。内部验证期为2015年1月1日至2018年4月10日，数据包括2014年12月31日年龄在16-75岁之间、既往无癌症指征的所有成年人。来自英国生物库的外部验证队列包括所有既往没有癌症指征的 50-75 岁成年人。我们根据丹麦人的综合病史建立了随时间变化的贝叶斯 Cox 危险模型。我们从现有的临床疾病轨迹、文本挖掘的基本健康因素和家族病史中提取了 1392 个协变量，用于训练 20 种主要癌症类型的预测模型。这些模型在 2015 年至 2018 年期间丹麦各地的癌症发病率和英国生物库中的个人身上进行了验证。主要结果是区分度和校准性能。研究结果我们从丹麦登记册中纳入了 6 732 553 人，涵盖 1978 年 1 月 1 日至 2018 年 4 月 10 日期间的 6000 万次医院就诊、9000 万次诊断和总计 1.93 亿生命年。2015年1月1日至2018年4月10日期间的丹麦登记数据用于内部验证风险模型，共包含4 248 491名仍有原发性恶性癌症诊断风险的个体和67 401个癌症病例记录。在外部验证中，我们评估了同一时期英国生物库中的 377 004 人和 11 486 个癌症病例。这些模型在丹麦数据上的预测性能显示出良好的区分度（一致性指数为 0-81 [SD 0-08]，范围从子宫颈癌的 0-66 [95% CI 0-65-0-67] 到肝癌的 0-91 [0-90-0-92]）。在控制年龄分布变化的情况下，英国生物库的直接转移结果与此相似（一致性指数为 0-66 [SD 0-08]，子宫颈癌的一致性指数为 0-55 [95% CI 0-44-0-66]，肺癌的一致性指数为 0-78 [0-77-0-79]）。除遗传因素外，癌症风险还与一系列先前诊断和健康因素有关。消化系统癌症（食道癌、胃癌、结肠直肠癌、肝癌和胰腺癌）以及甲状腺癌、肾癌和子宫癌的整体表现最佳。模型预测在丹麦和英国的医疗保健系统之间具有通用性。随着多种癌症早期检测方法的出现，基于电子健康记录的风险模型可作为筛查工作的补充。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-cancer risk stratification based on national health data: a retrospective modelling and validation study

Background

Health care is experiencing a drive towards digitisation, and many countries are implementing national health data resources. Although a range of cancer risk models exists, the utility on a population level for risk stratification across cancer types has not been fully explored. We aimed to close this gap by evaluating pan-cancer risk models built on electronic health records across the Danish population with validation in the UK Biobank.

Methods

In this retrospective modelling and validation study, data for model development and internal validation were derived from the following Danish health registries: the Central Person Registry, the Danish National Patient Registry, the death registry, the cancer registry, and full-text medical records from secondary care records in the capital region. The development data included adults aged 16–86 years without previous malignant cancers in the time period from Jan 1, 1995, to Dec 31, 2014. The internal validation period was from Jan 1, 2015, to April 10, 2018, and the data included all adults without a previous indication of cancer aged 16–75 years on Dec 31, 2014. The external validation cohort from the UK Biobank included all adults without a previous indication of cancer aged 50–75 years. We used time-dependent Bayesian Cox hazard models built on the combined medical history of Danish individuals. A set of 1392 covariates from available clinical disease trajectories, text-mined basic health factors, and family histories were used to train predictive models of 20 major cancer types. The models were validated on cancer incidence between 2015 and 2018 across Denmark and on individuals in the UK Biobank. The primary outcomes were discrimination and calibration performance.

Findings

From the Danish registries, we included 6 732 553 individuals covering 60 million hospital visits, 90 million diagnoses, and a total of 193 million life-years between Jan 1, 1978, and April 10, 2018. Danish registry data covering the period from Jan 1, 2015, to April 10, 2018, were used to internally validate risk models, containing a total of 4 248 491 individuals who remained at risk of a primary malignant cancer diagnosis and 67 401 cancer cases recorded. For the external validation, we evaluated the same time period in the UK Biobank covering 377 004 individuals with 11 486 cancer cases. The predictive performance of the models on Danish data showed good discrimination (concordance index 0·81 [SD 0·08], ranging from 0·66 [95% CI 0·65–0·67] for cervix uteri cancer to 0·91 [0·90–0·92] for liver cancer). Performance was similar on the UK Biobank in a direct transfer when controlling for shifts in the age distribution (concordance index 0·66 [SD 0·08], ranging from 0·55 [95% CI 0·44–0·66] for cervix uteri cancer to 0·78 [0·77–0·79] for lung cancer). Cancer risks were associated, in addition to heritable components, with a broad range of preceding diagnoses and health factors. The best overall performance was seen for cancers of the digestive system (oesophageal, stomach, colorectal, liver, and pancreatic) but also thyroid, kidney, and uterine cancers.

Interpretation

Data available in national electronic health databases can be used to approximate cancer risk factors and enable risk predictions in most cancer types. Model predictions generalise between the Danish and UK health-care systems. With the emergence of multi-cancer early detection tests, electronic health record-based risk models could supplement screening efforts.

Funding

Novo Nordisk Foundation and the Danish Innovation Foundation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Lancet Digital Health Multiple-

CiteScore

41.20

自引率

1.60%

发文量

232

审稿时长

13 weeks

期刊介绍： The Lancet Digital Health publishes important, innovative, and practice-changing research on any topic connected with digital technology in clinical medicine, public health, and global health. The journal’s open access content crosses subject boundaries, building bridges between health professionals and researchers.By bringing together the most important advances in this multidisciplinary field,The Lancet Digital Health is the most prominent publishing venue in digital health. We publish a range of content types including Articles,Review, Comment, and Correspondence, contributing to promoting digital technologies in health practice worldwide.