在缺乏领域知识的情况下利用大数据会影响公共卫生决策

IF 9.4 1区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES Proceedings of the National Academy of Sciences of the United States of America Pub Date : 2024-09-17 DOI:10.1073/pnas.2402387121
Miao Zhang, Salman Rahman, Vishwali Mhasawade, Rumi Chunara
{"title":"在缺乏领域知识的情况下利用大数据会影响公共卫生决策","authors":"Miao Zhang, Salman Rahman, Vishwali Mhasawade, Rumi Chunara","doi":"10.1073/pnas.2402387121","DOIUrl":null,"url":null,"abstract":"New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84–4.55) or 17.2 (95% CI 14.4–21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates.","PeriodicalId":20548,"journal":{"name":"Proceedings of the National Academy of Sciences of the United States of America","volume":null,"pages":null},"PeriodicalIF":9.4000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Utilizing big data without domain knowledge impacts public health decision-making\",\"authors\":\"Miao Zhang, Salman Rahman, Vishwali Mhasawade, Rumi Chunara\",\"doi\":\"10.1073/pnas.2402387121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84–4.55) or 17.2 (95% CI 14.4–21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates.\",\"PeriodicalId\":20548,\"journal\":{\"name\":\"Proceedings of the National Academy of Sciences of the United States of America\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":9.4000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the National Academy of Sciences of the United States of America\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1073/pnas.2402387121\",\"RegionNum\":1,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the National Academy of Sciences of the United States of America","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1073/pnas.2402387121","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

用于提取信息的新数据源和人工智能方法越来越丰富,与社会应用领域的决策也越来越相关。一个显著的例子是街景图像,在 100 多个国家都有提供,据称可为建筑环境干预措施(如增加人行道)提供信息,以改善社区健康状况。然而,如果决策没有考虑到数据的稳健性或依赖于虚假的相关性,就会产生偏差。为了研究这一风险,我们分析了 202 万张谷歌街景(GSV)图片以及纽约市的健康、人口和社会经济数据。研究结果表明了稳健性方面的挑战;从城市内部的 GSV 标签推断出的建筑环境特征往往与地面实况不符。此外,由于个人层面的平均身体不活动行为对各人口普查区的建筑环境特征的影响有显著的中介作用,如果没有适当的模型规范和对这种中介机制的考虑,对 GSV 测量特征的干预将被错误估计。利用考虑到这些中介因素的因果框架,我们确定,与按人口普查区对人行横道数量进行相同比例的干预相比,通过改善 10%的身体不活跃最低两分位数样本进行干预,将导致肥胖或糖尿病患病率分别下降 4.17 (95% CI 3.84-4.55) 或 17.2 (95% CI 14.4-21.3)倍。这项研究强调了使用新兴数据源时的稳健性和模型规范等关键问题,表明数据可能无法衡量预期效果,忽略中介因素可能导致干预效果估计值出现偏差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Utilizing big data without domain knowledge impacts public health decision-making
New data sources and AI methods for extracting information are increasingly abundant and relevant to decision-making across societal applications. A notable example is street view imagery, available in over 100 countries, and purported to inform built environment interventions (e.g., adding sidewalks) for community health outcomes. However, biases can arise when decision-making does not account for data robustness or relies on spurious correlations. To investigate this risk, we analyzed 2.02 million Google Street View (GSV) images alongside health, demographic, and socioeconomic data from New York City. Findings demonstrate robustness challenges; built environment characteristics inferred from GSV labels at the intracity level often do not align with ground truth. Moreover, as average individual-level behavior of physical inactivity significantly mediates the impact of built environment features by census tract, intervention on features measured by GSV would be misestimated without proper model specification and consideration of this mediation mechanism. Using a causal framework accounting for these mediators, we determined that intervening by improving 10% of samples in the two lowest tertiles of physical inactivity would lead to a 4.17 (95% CI 3.84–4.55) or 17.2 (95% CI 14.4–21.3) times greater decrease in the prevalence of obesity or diabetes, respectively, compared to the same proportional intervention on the number of crosswalks by census tract. This study highlights critical issues of robustness and model specification in using emergent data sources, showing the data may not measure what is intended, and ignoring mediators can result in biased intervention effect estimates.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
19.00
自引率
0.90%
发文量
3575
审稿时长
2.5 months
期刊介绍: The Proceedings of the National Academy of Sciences (PNAS), a peer-reviewed journal of the National Academy of Sciences (NAS), serves as an authoritative source for high-impact, original research across the biological, physical, and social sciences. With a global scope, the journal welcomes submissions from researchers worldwide, making it an inclusive platform for advancing scientific knowledge.
期刊最新文献
William B. Wood: Pioneering scientist and educator (1938 to 2024). Collaboration can preserve the integrity of gold standard carbon data from forest inventories. Mechanical confinement prevents ectopic platelet release. Toward a more credible assessment of the credibility of science by many-analyst studies. Wildlife trade data capture: National policy is foundational to science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1