Pub Date : 2026-03-25DOI: 10.1146/annurev-biodatasci-092724-035434
Geng Wang, Lavinia Paternoster, Nicole M Warrington
Genetic influences on how human traits change over time remain underexplored and may play an important role in disease processes. In this review, we explore emerging statistical approaches for incorporating longitudinal data on trait trajectories into genetic epidemiology studies, including longitudinal genome-wide association studies, polygenic scores, and Mendelian randomization. We discuss the caution required when analyzing longitudinal data focused on disease progression, where analyses are conducted within a group of patients rather than the general population. Finally, we outline the large longitudinal data resources that are available and discuss future directions in trajectory-based genetic epidemiological studies. Embracing time as a critical dimension of human traits offers deeper insight into disease pathways and intervention opportunities.
{"title":"Statistical Methods for Understanding Trajectories in Genetic Epidemiology.","authors":"Geng Wang, Lavinia Paternoster, Nicole M Warrington","doi":"10.1146/annurev-biodatasci-092724-035434","DOIUrl":"https://doi.org/10.1146/annurev-biodatasci-092724-035434","url":null,"abstract":"<p><p>Genetic influences on how human traits change over time remain underexplored and may play an important role in disease processes. In this review, we explore emerging statistical approaches for incorporating longitudinal data on trait trajectories into genetic epidemiology studies, including longitudinal genome-wide association studies, polygenic scores, and Mendelian randomization. We discuss the caution required when analyzing longitudinal data focused on disease progression, where analyses are conducted within a group of patients rather than the general population. Finally, we outline the large longitudinal data resources that are available and discuss future directions in trajectory-based genetic epidemiological studies. Embracing time as a critical dimension of human traits offers deeper insight into disease pathways and intervention opportunities.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2026-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147515301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-24DOI: 10.1146/annurev-biodatasci-092724-031932
Bradley A Malin, Chao Yan, Luca Bonomi
Health data are increasingly generated, shared, and analyzed across an ever-growing collection of settings. While these developments enable new forms of biomedical discovery and clinical decision support, they also introduce evolving privacy, security, and trust challenges that extend beyond traditional regulatory and technical frameworks. In this review, we characterize the various risks and protections throughout the health data life cycle, from data generation and primary use in healthcare to secondary use in research and artificial intelligence (AI) model development. We discuss how regulation, organizational practices, and technological choices shape data protection requirements, and we discuss and contextualize emerging threats, such as incidental disclosures through AI tools. We further review technical approaches for mitigating these risks, including access control and auditing, reidentification risk assessment and statistical mechanisms for risk mitigation (e.g., differential privacy), and synthetic data generation. We also consider how collaboration across disparate organizations may be achieved through federated learning mechanisms and cryptographic technologies, such as secure multiparty computation. Throughout, we highlight trade-offs between privacy protection and data utility, and we articulate practical challenges in deploying these methods at scale. We conclude by identifying open issues for the field, including the need for standardized metrics and greater transparency to support trust in data-driven healthcare and research.
{"title":"Privacy and Security Throughout the Health Data Life Cycle: From Primary Care to Research Networks.","authors":"Bradley A Malin, Chao Yan, Luca Bonomi","doi":"10.1146/annurev-biodatasci-092724-031932","DOIUrl":"https://doi.org/10.1146/annurev-biodatasci-092724-031932","url":null,"abstract":"<p><p>Health data are increasingly generated, shared, and analyzed across an ever-growing collection of settings. While these developments enable new forms of biomedical discovery and clinical decision support, they also introduce evolving privacy, security, and trust challenges that extend beyond traditional regulatory and technical frameworks. In this review, we characterize the various risks and protections throughout the health data life cycle, from data generation and primary use in healthcare to secondary use in research and artificial intelligence (AI) model development. We discuss how regulation, organizational practices, and technological choices shape data protection requirements, and we discuss and contextualize emerging threats, such as incidental disclosures through AI tools. We further review technical approaches for mitigating these risks, including access control and auditing, reidentification risk assessment and statistical mechanisms for risk mitigation (e.g., differential privacy), and synthetic data generation. We also consider how collaboration across disparate organizations may be achieved through federated learning mechanisms and cryptographic technologies, such as secure multiparty computation. Throughout, we highlight trade-offs between privacy protection and data utility, and we articulate practical challenges in deploying these methods at scale. We conclude by identifying open issues for the field, including the need for standardized metrics and greater transparency to support trust in data-driven healthcare and research.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2026-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147515337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-09DOI: 10.1146/annurev-biodatasci-092724-030452
Manuel Corpas, Oyesola Ojewunmi, Heinner Guio, Segun Fatumo
Electronic health record (EHR)-linked biobanks are transforming biomedical research, enabling population-scale studies that integrate genomic, clinical, and phenotypic data. Yet as these resources proliferate, it remains unclear how their research outputs reflect global health priorities. This article presents a comprehensive review of five globally established EHR-linked biobanks: UK Biobank, the Million Veteran Program, FinnGen, the All of Us Research Program, and the Estonian Biobank. Drawing on 14,142 peer-reviewed publications from 2000 to 2024, we show how each biobank displays a distinct thematic profile, shaped by institutional mandates, population focus, and methodological design. We further evaluate alignment with global disease burden by mapping biobank-linked publications to 25 high-priority disease areas using World Health Organization disability-adjusted life years data. Our burden-adjusted gap scores and opportunity indices reveal striking underrepresentation of conditions such as malaria, tuberculosis, and diarrheal diseases when comparing biobank research output against high-priority diseases.
{"title":"Electronic Health Record-Linked Biobank Expansion Reveals Global Health Inequities.","authors":"Manuel Corpas, Oyesola Ojewunmi, Heinner Guio, Segun Fatumo","doi":"10.1146/annurev-biodatasci-092724-030452","DOIUrl":"https://doi.org/10.1146/annurev-biodatasci-092724-030452","url":null,"abstract":"<p><p>Electronic health record (EHR)-linked biobanks are transforming biomedical research, enabling population-scale studies that integrate genomic, clinical, and phenotypic data. Yet as these resources proliferate, it remains unclear how their research outputs reflect global health priorities. This article presents a comprehensive review of five globally established EHR-linked biobanks: UK Biobank, the Million Veteran Program, FinnGen, the All of Us Research Program, and the Estonian Biobank. Drawing on 14,142 peer-reviewed publications from 2000 to 2024, we show how each biobank displays a distinct thematic profile, shaped by institutional mandates, population focus, and methodological design. We further evaluate alignment with global disease burden by mapping biobank-linked publications to 25 high-priority disease areas using World Health Organization disability-adjusted life years data. Our burden-adjusted gap scores and opportunity indices reveal striking underrepresentation of conditions such as malaria, tuberculosis, and diarrheal diseases when comparing biobank research output against high-priority diseases.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147391363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-06DOI: 10.1146/annurev-biodatasci-092724-043507
Kivilcim Ozturk, Adam Klie, Hannah Carter
Precision medicine aims to tailor treatment to the individual, improving medical outcomes and quality of life. Realizing this vision requires understanding how disease mechanisms and drug responses vary across patients. Advances in molecular profiling have enabled detailed measurement of genetic, epigenetic, spatial, and imaging features at multiple biological scales, from single cells to tissues. These rich and complementary data promise insight into the drivers of human disease where individual data layers have often provided an incomplete picture. Increasingly, studies span multiple measurement modalities and scales, presenting both opportunities and challenges. Central among these is how to combine data types to uncover actionable biology. This review surveys computational strategies for analyzing multimodal and multiscale datasets, distinguishing between approaches that treat each modality independently and those that perform true integrative modeling. We highlight emerging methods, with a focus on oncology, where these tools are helping to reveal mechanisms and guide therapeutic decisions.
{"title":"Precision Oncology: Multimodal and Multiscale Methods to Promote Mechanistic Understanding.","authors":"Kivilcim Ozturk, Adam Klie, Hannah Carter","doi":"10.1146/annurev-biodatasci-092724-043507","DOIUrl":"https://doi.org/10.1146/annurev-biodatasci-092724-043507","url":null,"abstract":"<p><p>Precision medicine aims to tailor treatment to the individual, improving medical outcomes and quality of life. Realizing this vision requires understanding how disease mechanisms and drug responses vary across patients. Advances in molecular profiling have enabled detailed measurement of genetic, epigenetic, spatial, and imaging features at multiple biological scales, from single cells to tissues. These rich and complementary data promise insight into the drivers of human disease where individual data layers have often provided an incomplete picture. Increasingly, studies span multiple measurement modalities and scales, presenting both opportunities and challenges. Central among these is how to combine data types to uncover actionable biology. This review surveys computational strategies for analyzing multimodal and multiscale datasets, distinguishing between approaches that treat each modality independently and those that perform true integrative modeling. We highlight emerging methods, with a focus on oncology, where these tools are helping to reveal mechanisms and guide therapeutic decisions.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2026-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147370407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-04-08DOI: 10.1146/annurev-biodatasci-020722-114525
Grace D Ramey, Hannah Takasuka, John A Capra
The growth of electronic health record (EHR) databases in size and availability has created an unprecedented opportunity to better understand human health and disease. However, conducting robust EHR studies requires careful filtering criteria and study design, as EHRs pose several challenges that can confound analyses and lead to inaccurate results. Here we review these challenges and make suggestions about how to avoid or adjust for major confounders and biases in common EHR study designs. We further highlight qualities of EHR data that make different diseases more or less feasible for study. These recommendations for conducting research using EHRs will help inform database selection, improve reproducibility of results across the field, and enhance the validity of study results.
{"title":"Strategies for Creating Robust Patient Groups to Study Diverse Conditions with Electronic Health Records.","authors":"Grace D Ramey, Hannah Takasuka, John A Capra","doi":"10.1146/annurev-biodatasci-020722-114525","DOIUrl":"10.1146/annurev-biodatasci-020722-114525","url":null,"abstract":"<p><p>The growth of electronic health record (EHR) databases in size and availability has created an unprecedented opportunity to better understand human health and disease. However, conducting robust EHR studies requires careful filtering criteria and study design, as EHRs pose several challenges that can confound analyses and lead to inaccurate results. Here we review these challenges and make suggestions about how to avoid or adjust for major confounders and biases in common EHR study designs. We further highlight qualities of EHR data that make different diseases more or less feasible for study. These recommendations for conducting research using EHRs will help inform database selection, improve reproducibility of results across the field, and enhance the validity of study results.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":"317-340"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143812613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-05-01DOI: 10.1146/annurev-biodatasci-101424-121439
Lucy Ham, Taylor E Woodward, Megan A Coomer, Michael P H Stumpf
Many cellular processes involve information processing and decision-making. We can probe these processes at increasing molecular detail. The analysis of heterogeneous data remains a challenge that requires new ways of thinking about cells in quantitative, predictive, and mechanistic ways. We discuss the role of mathematical models in the context of cell-fate decision-making systems across the tree of life. Complex multicellular organisms have been a particular focus, but single-celled organisms also have to sense and respond to their environment. We center our discussion around the idea of design principles that we can learn from observations and modeling and exploit in order to (re)-design or guide cellular behavior.
{"title":"Mapping, Modeling, and Reprogramming Cell-Fate Decision-Making Systems.","authors":"Lucy Ham, Taylor E Woodward, Megan A Coomer, Michael P H Stumpf","doi":"10.1146/annurev-biodatasci-101424-121439","DOIUrl":"10.1146/annurev-biodatasci-101424-121439","url":null,"abstract":"<p><p>Many cellular processes involve information processing and decision-making. We can probe these processes at increasing molecular detail. The analysis of heterogeneous data remains a challenge that requires new ways of thinking about cells in quantitative, predictive, and mechanistic ways. We discuss the role of mathematical models in the context of cell-fate decision-making systems across the tree of life. Complex multicellular organisms have been a particular focus, but single-celled organisms also have to sense and respond to their environment. We center our discussion around the idea of design principles that we can learn from observations and modeling and exploit in order to (re)-design or guide cellular behavior.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":"537-562"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143984534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-04-16DOI: 10.1146/annurev-biodatasci-103123-094804
Ryan Baker, Josep Bassaganya-Riera, Nuria Tubau-Juni, Andrew J Leber, Raquel Hontecillas
The TITAN-X Precision Medicine Platform was engineered to rapidly, fully, and efficiently utilize large-scale immunology datasets, including public data, in drug discovery and development. TITAN-X integrates big data with artificial intelligence (AI), bioinformatics, and advanced computational modeling to seamlessly transition from early target discovery to clinical testing of new therapeutics, developing biomarker-driven precision medicines tailored to specific patient populations. We illustrate the capabilities of TITAN-X through four case studies, demonstrating its use in computationally driven target discovery; characterization of novel immunometabolic mechanisms in infectious, inflammatory, and autoimmune diseases; and identification of biomarker signatures for patient stratification in clinical trials designed to maximize therapeutic efficacy and safety. Data-driven and AI-powered approaches like TITAN-X are enhancing the pace of drug development, reducing costs, tailoring treatments, and increasing the probability of success in clinical trials.
{"title":"The TITAN-X Platform Integrates Big Data, Artificial Intelligence, Bioinformatics, and Advanced Computational Modeling to Understand Immune Responses and Develop the Next Wave of Precision Medicines.","authors":"Ryan Baker, Josep Bassaganya-Riera, Nuria Tubau-Juni, Andrew J Leber, Raquel Hontecillas","doi":"10.1146/annurev-biodatasci-103123-094804","DOIUrl":"10.1146/annurev-biodatasci-103123-094804","url":null,"abstract":"<p><p>The TITAN-X Precision Medicine Platform was engineered to rapidly, fully, and efficiently utilize large-scale immunology datasets, including public data, in drug discovery and development. TITAN-X integrates big data with artificial intelligence (AI), bioinformatics, and advanced computational modeling to seamlessly transition from early target discovery to clinical testing of new therapeutics, developing biomarker-driven precision medicines tailored to specific patient populations. We illustrate the capabilities of TITAN-X through four case studies, demonstrating its use in computationally driven target discovery; characterization of novel immunometabolic mechanisms in infectious, inflammatory, and autoimmune diseases; and identification of biomarker signatures for patient stratification in clinical trials designed to maximize therapeutic efficacy and safety. Data-driven and AI-powered approaches like TITAN-X are enhancing the pace of drug development, reducing costs, tailoring treatments, and increasing the probability of success in clinical trials.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":"447-469"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144039901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1146/annurev-biodatasci-103123-094441
Yiwen Lu, Bingyu Zhang, Jiayi Tong, Yong Chen
Distributed research networks have transformed modern clinical research by enabling large-scale, multi-institutional collaborations while maintaining patient privacy. Two prominent methodologies within these frameworks-meta-analysis and federated learning-address the challenges of synthesizing evidence from decentralized data. Meta-analysis aggregates study-level results to provide robust, interpretable estimates, making it a cornerstone of evidence synthesis for association studies. Federated learning complements this by enabling complex downstream tasks, such as predictive modeling and counterfactual inference, while preserving data privacy through privacy-preserving distributed algorithms. Federated learning facilitates communication-efficient computation and adapts seamlessly to heterogeneous datasets across diverse institutions. This review emphasizes the complementary strengths of federated learning's scalability, flexibility, and readiness for implementation alongside meta-analysis's robust frameworks for evidence synthesis and aggregation in clinical research. Integrations of synthetic data, artificial intelligence (AI)-enhanced harmonization, and hybrid human-AI frameworks are proposed as future directions, promising to further advance both methodologies and enhance their combined impact on privacy-conscious, data-driven healthcare research.
{"title":"Meta-Analysis and Federated Learning over Decentralized Distributed Research Networks.","authors":"Yiwen Lu, Bingyu Zhang, Jiayi Tong, Yong Chen","doi":"10.1146/annurev-biodatasci-103123-094441","DOIUrl":"10.1146/annurev-biodatasci-103123-094441","url":null,"abstract":"<p><p>Distributed research networks have transformed modern clinical research by enabling large-scale, multi-institutional collaborations while maintaining patient privacy. Two prominent methodologies within these frameworks-meta-analysis and federated learning-address the challenges of synthesizing evidence from decentralized data. Meta-analysis aggregates study-level results to provide robust, interpretable estimates, making it a cornerstone of evidence synthesis for association studies. Federated learning complements this by enabling complex downstream tasks, such as predictive modeling and counterfactual inference, while preserving data privacy through privacy-preserving distributed algorithms. Federated learning facilitates communication-efficient computation and adapts seamlessly to heterogeneous datasets across diverse institutions. This review emphasizes the complementary strengths of federated learning's scalability, flexibility, and readiness for implementation alongside meta-analysis's robust frameworks for evidence synthesis and aggregation in clinical research. Integrations of synthetic data, artificial intelligence (AI)-enhanced harmonization, and hybrid human-AI frameworks are proposed as future directions, promising to further advance both methodologies and enhance their combined impact on privacy-conscious, data-driven healthcare research.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":"8 1","pages":"405-421"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144822752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-01-29DOI: 10.1146/annurev-biodatasci-103123-095633
Kevin K Tsang, Sophia Kivelson, Jose M Acitores Cortina, Aditi Kuchi, Jacob S Berkowitz, Hongyu Liu, Apoorva Srinivasan, Nadine A Friedrich, Yasaman Fatapour, Nicholas P Tatonetti
Cancer remains a leading cause of death globally. The complexity and diversity of cancer-related datasets across different specialties pose challenges in refining precision medicine for oncology. Foundation models offer a promising solution. Trained on vast amounts of data, these models develop a broad understanding across a wide range of tasks. We examine the role of foundation models in domains relevant to cancer research, including natural language processing, computer vision, molecular biology, and cheminformatics. Through a review of state-of-the-art methods, we explore how these models have already advanced translational cancer research goals such as precision tumor classification and artificial intelligence-assisted surgery. We also discuss prospective advances in areas like early tumor detection, personalized cancer treatment, and drug discovery. This review provides researchers with a curated set of resources and methodologies, offers practitioners a deeper understanding of how these models enhance cancer care, and points to opportunities for future applications of foundation models in cancer research.
{"title":"Foundation Models for Translational Cancer Biology.","authors":"Kevin K Tsang, Sophia Kivelson, Jose M Acitores Cortina, Aditi Kuchi, Jacob S Berkowitz, Hongyu Liu, Apoorva Srinivasan, Nadine A Friedrich, Yasaman Fatapour, Nicholas P Tatonetti","doi":"10.1146/annurev-biodatasci-103123-095633","DOIUrl":"10.1146/annurev-biodatasci-103123-095633","url":null,"abstract":"<p><p>Cancer remains a leading cause of death globally. The complexity and diversity of cancer-related datasets across different specialties pose challenges in refining precision medicine for oncology. Foundation models offer a promising solution. Trained on vast amounts of data, these models develop a broad understanding across a wide range of tasks. We examine the role of foundation models in domains relevant to cancer research, including natural language processing, computer vision, molecular biology, and cheminformatics. Through a review of state-of-the-art methods, we explore how these models have already advanced translational cancer research goals such as precision tumor classification and artificial intelligence-assisted surgery. We also discuss prospective advances in areas like early tumor detection, personalized cancer treatment, and drug discovery. This review provides researchers with a curated set of resources and methodologies, offers practitioners a deeper understanding of how these models enhance cancer care, and points to opportunities for future applications of foundation models in cancer research.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":"51-80"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-03-18DOI: 10.1146/annurev-biodatasci-103123-095202
Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol
Generative artificial intelligence (AI), operationalized as large language models, is increasingly used in the biomedical field to assist with a range of text processing tasks including text classification, information extraction, and decision support. In this article, we focus on the primary purpose of generative language models, namely the production of unstructured text. We review past and current methods used to generate text as well as methods for evaluating open text generation, i.e., in contexts where no reference text is available for comparison. We discuss clinical applications that can benefit from high quality, ethically designed text generation, such as clinical note generation and synthetic text generation in support of secondary use of health data. We also raise awareness of the risks involved with generative AI such as overconfidence in outputs due to anthropomorphism and the risk of representational and allocation harms due to biases.
{"title":"Clinical Text Generation: Are We There Yet?","authors":"Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol","doi":"10.1146/annurev-biodatasci-103123-095202","DOIUrl":"10.1146/annurev-biodatasci-103123-095202","url":null,"abstract":"<p><p>Generative artificial intelligence (AI), operationalized as large language models, is increasingly used in the biomedical field to assist with a range of text processing tasks including text classification, information extraction, and decision support. In this article, we focus on the primary purpose of generative language models, namely the production of unstructured text. We review past and current methods used to generate text as well as methods for evaluating open text generation, i.e., in contexts where no reference text is available for comparison. We discuss clinical applications that can benefit from high quality, ethically designed text generation, such as clinical note generation and synthetic text generation in support of secondary use of health data. We also raise awareness of the risks involved with generative AI such as overconfidence in outputs due to anthropomorphism and the risk of representational and allocation harms due to biases.</p>","PeriodicalId":29775,"journal":{"name":"Annual Review of Biomedical Data Science","volume":" ","pages":"173-198"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143658875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}