Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method

IF 3.5 Q1 HEALTH CARE SCIENCES & SERVICES JMIR infodemiology Pub Date : 2023-04-28 DOI:10.2196/42721

Jiayu Li, Zhiyu He, M. Zhang, Weizhi Ma, Ye Jin, Lei Zhang, Shu-you Zhang, Yiqun Liu, Shaoping Ma

{"title":"Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method","authors":"Jiayu Li, Zhiyu He, M. Zhang, Weizhi Ma, Ye Jin, Lei Zhang, Shu-you Zhang, Yiqun Liu, Shaoping Ma","doi":"10.2196/42721","DOIUrl":null,"url":null,"abstract":"Background As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidence estimates has become an essential concern in public health. Since RDs are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registries. With the development of the internet, users have become accustomed to searching for disease-related information through search engines before seeking medical treatment. Therefore, online search data provide a new source for estimating RD incidences. Objective The aim of this study was to estimate the incidences of multiple RDs in distinct regions of China with online search data. Methods Our research scale included 15 RDs in China from 2016 to 2019. The online search data were obtained from Sogou, one of the top 3 commercial search engines in China. By matching to multilevel keywords related to 15 RDs during the 4 years, we retrieved keyword-matched RD-related queries. The queries used before and after the keyword-matched queries formed the basis of the RD-related search sessions. A two-step method was developed to estimate RD incidences with users’ intents conveyed by the sessions. In the first step, a combination of long short-term memory and multilayer perceptron algorithms was used to predict whether the intents of search sessions were RD-concerned, news-concerned, or others. The second step utilized a linear regression (LR) model to estimate the incidences of multiple RDs in distinct regions based on the RD- and news-concerned session numbers. For evaluation, the estimated incidences were compared with RD incidences collected from China’s national multicenter clinical database of RDs. The root mean square error (RMSE) and relative error rate (RER) were used as the evaluation metrics. Results The RD-related online data included 2,749,257 queries and 1,769,986 sessions from 1,380,186 users from 2016 to 2019. The best LR model with sessions as the input estimated the RD incidences with an RMSE of 0.017 (95% CI 0.016-0.017) and an RER of 0.365 (95% CI 0.341-0.388). The best LR model with queries as input had an RMSE of 0.023 (95% CI 0.017-0.029) and an RER of 0.511 (95% CI 0.377-0.645). Compared with queries, using session intents achieved an error decrease of 28.57% in terms of the RER (P=.01). Analysis of different RDs and regions showed that session input was more suitable for estimating the incidences of most diseases (14 of 15 RDs). Moreover, examples focusing on two RDs showed that news-concerned session intents reflected news of an outbreak and helped correct the overestimation of incidences. Experiments on RD types further indicated that type had no significant influence on the RD estimation task. Conclusions This work sheds light on a novel method for rapid estimation of RD incidences in the internet era, and demonstrates that search session intents were especially helpful for the estimation. The proposed two-step estimation method could be a valuable supplement to the traditional registry for understanding RDs, planning policies, and allocating medical resources. The utilization of search sessions in disease detection and estimation could be transferred to infoveillance of large-scale epidemics or chronic diseases.","PeriodicalId":73554,"journal":{"name":"JMIR infodemiology","volume":"3 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR infodemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/42721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 1

Abstract

Background As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidence estimates has become an essential concern in public health. Since RDs are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registries. With the development of the internet, users have become accustomed to searching for disease-related information through search engines before seeking medical treatment. Therefore, online search data provide a new source for estimating RD incidences. Objective The aim of this study was to estimate the incidences of multiple RDs in distinct regions of China with online search data. Methods Our research scale included 15 RDs in China from 2016 to 2019. The online search data were obtained from Sogou, one of the top 3 commercial search engines in China. By matching to multilevel keywords related to 15 RDs during the 4 years, we retrieved keyword-matched RD-related queries. The queries used before and after the keyword-matched queries formed the basis of the RD-related search sessions. A two-step method was developed to estimate RD incidences with users’ intents conveyed by the sessions. In the first step, a combination of long short-term memory and multilayer perceptron algorithms was used to predict whether the intents of search sessions were RD-concerned, news-concerned, or others. The second step utilized a linear regression (LR) model to estimate the incidences of multiple RDs in distinct regions based on the RD- and news-concerned session numbers. For evaluation, the estimated incidences were compared with RD incidences collected from China’s national multicenter clinical database of RDs. The root mean square error (RMSE) and relative error rate (RER) were used as the evaluation metrics. Results The RD-related online data included 2,749,257 queries and 1,769,986 sessions from 1,380,186 users from 2016 to 2019. The best LR model with sessions as the input estimated the RD incidences with an RMSE of 0.017 (95% CI 0.016-0.017) and an RER of 0.365 (95% CI 0.341-0.388). The best LR model with queries as input had an RMSE of 0.023 (95% CI 0.017-0.029) and an RER of 0.511 (95% CI 0.377-0.645). Compared with queries, using session intents achieved an error decrease of 28.57% in terms of the RER (P=.01). Analysis of different RDs and regions showed that session input was more suitable for estimating the incidences of most diseases (14 of 15 RDs). Moreover, examples focusing on two RDs showed that news-concerned session intents reflected news of an outbreak and helped correct the overestimation of incidences. Experiments on RD types further indicated that type had no significant influence on the RD estimation task. Conclusions This work sheds light on a novel method for rapid estimation of RD incidences in the internet era, and demonstrates that search session intents were especially helpful for the estimation. The proposed two-step estimation method could be a valuable supplement to the traditional registry for understanding RDs, planning policies, and allocating medical resources. The utilization of search sessions in disease detection and estimation could be transferred to infoveillance of large-scale epidemics or chronic diseases.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用大规模互联网搜索数据估计罕见病发病率：两步机器学习方法的开发和评估

随着罕见病(RDs)受到越来越多的关注，获得准确的RDs发病率已成为公共卫生关注的重要问题。由于RD难以诊断，类型多样，病例稀少，传统的流行病学方法在RD登记中是昂贵的。随着互联网的发展，用户已经习惯在就医前通过搜索引擎搜索疾病相关信息。因此，在线搜索数据为估计RD发病率提供了新的来源。目的利用网络搜索数据估计中国不同地区多种rd的发病率。方法选取2016 - 2019年国内15家研发企业为研究对象。在线搜索数据来源于中国三大商业搜索引擎之一的b搜狗。通过对4年间与15个rd相关的多层次关键字进行匹配，我们检索到与关键字匹配的rd相关查询。关键字匹配查询前后使用的查询构成了rd相关搜索会话的基础。一个两步的方法被开发来估计RD的发生率与用户的意图传达的会话。在第一步中，使用长短期记忆和多层感知器算法的组合来预测搜索会话的意图是与rd有关，与新闻有关还是其他。第二步利用线性回归(LR)模型，根据RD和新闻相关的会话数估计不同地区的多个RD的发生率。为了进行评估，将估计的发病率与中国国家多中心RD临床数据库收集的RD发病率进行了比较。采用均方根误差(RMSE)和相对错误率(RER)作为评价指标。结果2016年至2019年，与rd相关的在线数据包括1,380,186名用户的2,749,257次查询和1,769,986次会话。以会话为输入的最佳LR模型估计RD发生率的RMSE为0.017 (95% CI为0.016-0.017)，RER为0.365 (95% CI为0.341-0.388)。以查询作为输入的最佳LR模型的RMSE为0.023 (95% CI为0.017-0.029)，RER为0.511 (95% CI为0.377-0.645)。与查询相比，就RER而言，使用会话意图的错误减少了28.57% (P= 0.01)。对不同区域和区域的分析表明，会话输入更适合于估计大多数疾病的发病率(15个rd中的14个)。此外，以两个rd为重点的例子表明，与新闻有关的会议意图反映了爆发的新闻，并有助于纠正对发病率的高估。对RD类型的实验进一步表明，类型对RD估计任务没有显著影响。结论本研究提出了一种快速估计互联网时代RD发生率的新方法，并证明了搜索会话意图对估计特别有帮助。所提出的两步估计方法对于理解rd、规划政策和分配医疗资源可能是传统注册表的有价值的补充。搜索会话在疾病检测和估计中的应用可以转移到大规模流行病或慢性病的信息监测中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR infodemiology

CiteScore

4.80

自引率

0.00%

发文量