{"title":"第一,不伤害:解决人工智能在医学中不分布数据的挑战。","authors":"Chu Weng, Wesley Lin, Sherry Dong, Qi Liu, Hanrui Zhang","doi":"10.1111/cts.70132","DOIUrl":null,"url":null,"abstract":"<p>The advent of AI has brought transformative changes across many fields, particularly in biomedical field, where AI is now being used to facilitate drug discovery and development, enhance diagnostic and prognostic accuracy, and support clinical decision-making. For example, since 2021, there has been a notable increase in AI-related submissions to the US Food and Drug Administration (FDA) Center for Drug Evaluation and Research (CDER), reflecting the rapid expansion of AI applications in drug development [<span>1</span>]. In addition, the rapid growth in AI health applications is reflected by the exponential increase in the number of such studies found on PubMed [<span>2</span>]. However, the translation of AI models from development to real-world deployment remains challenging. This is due to various factors, including data drift, where the characteristics of data in the deployment phase differ from those used in model training. Consequently, ensuring the performance of medical AI models in the deployment phase has become a critical area of focus, as AI models that excel in controlled environments may still struggle with real-world variability, leading to poor predictions for patients whose characteristics differ significantly from the training set. Such cases, often referred to as OOD samples, present a major challenge for AI-driven decision-making, such as making diagnosis or selecting treatments for a patient. The failure to recognize these OOD samples can result in suboptimal or even harmful decisions.</p><p>To address this, we propose a prescreening procedure for medical AI model deployment (especially when the AI model risk is high), aimed at avoiding or flagging the predictions by AI models on OOD samples (Figure 1a). This procedure, we believe, can be beneficial for ensuring the trustworthiness of AI in medicine.</p><p>OOD scenarios are a common challenge in medical AI applications. For instance, a model trained predominantly on data from a specific demographic group may underperform when applied to patients from different demographic groups, resulting in inaccurate predictions. OOD cases can also arise when AI models encounter data that differ from the training data due to factors like variations in medical practices and treatment landscapes of the clinical trials. These issues can potentially lead to harm to patients (e.g., misdiagnosis, inappropriate treatment recommendations), and a loss of trust in AI systems.</p><p>The importance of detecting OOD samples to define the scope of use for AI models has been highlighted in multiple research and clinical studies. A well-known example is the Medical Out-of-Distribution-Analysis (MOOD) Challenge [<span>3</span>], which benchmarked OOD detection algorithms across several supervised and unsupervised models, including autoencoder neural networks, U-Net, vector-quantized variational autoencoders, principle component analysis (PCA), and linear Gaussian process regression. These algorithms were used to identify brain magnetic resonance imaging (MRI) and abdominal computed tomography (CT) scan images that deviated from the training data, thereby reducing the risk of overconfident predictions from machine learning models. Similarly, methods such as the Gram matrix algorithm and linear/outlier synthesis have been employed to detect OOD samples in skin lesion images [<span>4</span>].</p><p>Beyond medical imaging, OOD detection has also been recommended for other healthcare data types, such as electronic health records (EHRs), to enhance model reliability [<span>5</span>]. In addition to diagnostic applications, OOD detection can enrich clinical trial cohorts by identifying patients with canonical symptoms. For example, Hopkins et al. used anomaly scores to determine whether patients with bipolar depression should be included in clinical trials for non-racemic amisulpride (SEP-4199). The patients identified as anomalies exhibited distinct responses to the treatment compared to canonical patients [<span>6</span>].</p><p>To demonstrate how OOD detection techniques can be integrated into existing medical AI pipelines, we extend a previously published antimalarial prediction model by incorporating a machine-learning-based OOD detector (Figure 1b). After adding OOD detection, the system exhibits more robust performance when evaluated on transcriptomes from a previously unseen geographic region.</p><p>We had originally trained a tree-based gradient boosting algorithm, LightGBM, using transcriptomes from <i>Plasmodium falciparum</i> isolates obtained from patients, to predict resistance to artemisinin, an antimalarial drug [<span>7</span>]. Briefly, the training data consisted of transcriptomes from isolates in Southeast Asia, alongside the clearance rates of these isolates [<span>8</span>]. Isolates with slow clearance rates were labeled as resistant, while others were classified as non-resistant [<span>7</span>].</p><p>To enhance the model, we incorporated an OOD detection approach that discriminates between in-distribution (ID) and OOD samples. This was done based on the distance between each sample in the latent space of a deep neural network, trained using contrastive learning, and its <i>k</i>-th nearest neighbor in the training set [<span>9</span>]. If the distance exceeded a defined threshold, set such that 5% of the training observations are classified as OOD, the sample was classified as OOD. This approach has demonstrated strong performance in other domains, such as image detection and time series modeling [<span>9, 10</span>].</p><p>To simulate applying a pretrained model in a new setting, we tested artemisinin resistance in a geographically distinct region. We trained our model on 786 transcriptomes from Southeast Asian countries in the Mok et al. dataset, excluding Myanmar. We then validated the pretrained model on samples from Myanmar, using the deep nearest neighbor approach [<span>9</span>] to identify which samples were OOD relative to the training data. We first evaluated the pretrained model on the entire validation set, then removed the OOD samples and reassessed performance to examine the impact.</p><p>During OOD detection, 5 of the 82 samples from Myanmar were identified as OOD. We evaluated the model's predictive performance using AUROC, achieving 0.5973 [0.5508, 0.6605] AUROC on the full Myanmar dataset. After removing the OOD samples, performance improved to 0.6934 [0.6356, 0.7593] AUROC. In contrast, for the five OOD samples, the model's performance was significantly lower, at 0.3310 [0.2153, 0.4421] AUROC—well below random chance (AUROC = 0.5). These results demonstrate that OOD detection effectively identifies samples where the pretrained model's predictions are unreliable and improved the performances by removing unreliable samples in the validation settings.</p><p>AI models need a clearly defined scope of use in terms of input data. This is especially true when moving the AI model from the development stage into a real-world setting. For example, many AI models are developed using data from clinical trials. In general, clinical trial data are generated from highly controlled environments, with carefully selected, relatively homogeneous patient populations. These trials have predetermined durations and predefined endpoints. In contrast, real-world data (RWD) are collected under routine clinical conditions, often in an uncontrolled setting, from patients with diverse demographics, varying disease presentations, and different treatment approaches. The heterogeneity in RWD can present challenges, as AI models may struggle to generalize to broader, more diverse populations. Unlike clinical trials, RWD might not have predefined endpoints for data collection, and it may introduce more variability and potential biases. These discrepancies between clinical trial data and real-world data can lead to suboptimal performance when AI models are applied beyond the scope of their original training.</p><p>Given the high dimensionality of AI model input data, defining the scope of use can become complex. OOD detection can play a valuable role in addressing this challenge. By incorporating OOD detection as a pre-screening step, clinicians can better evaluate whether AI models are suitable for a specific patient, enhancing the safety and effectiveness of AI in medicine. This approach is crucial for reducing the risk of incorrect diagnoses or treatments and ensuring AI is used responsibly in healthcare settings.</p><p>However, OOD detection can be challenging, and current methods still have limitations. In most datasets, there are no explicit labels indicating whether data is OOD. As a result, OOD models are often trained to detect when input data follows a distribution different from the training data, introducing an element of randomness into the detection process. Combining multiple OOD detection models with different random seeds or using models that rely on various mechanisms—such as K-nearest neighbors, isolation forest, Bayesian neural networks, variational autoencoders (VAEs), or contrastive learning—can improve the specificity of OOD detection. More research is needed to further improve the OOD detection methods.</p><p>It is important to note that the OOD approach could incentivize greater inclusivity and diversity in clinical trials and training data. Generally speaking, the more inclusive and diverse the clinical trials and training data are, the broader the scope of use will be for the AI model, as fewer patients will be classified as OOD during the model deployment phase. In addition, the OOD approach can help identify gaps in data inclusivity and offer valuable insights for improving future data collection and enhancing the diversity in clinical trial datasets.</p><p>Looking ahead, the integration of OOD detection into medical AI systems can be an important step toward responsible AI deployment. By explicitly addressing the limitations of our training data and our model capabilities, we can build more trustworthy AI systems that align with the rigorous standards of medical practice and the fundamental principle of “first, do no harm.”</p><p>The authors declare no conflicts of interest.</p><p>This article reflects the views of the author and should not be construed to represent FDA’s views or policies. As an Associate Editor for Clinical and Translational Science, Qi Liu was not involved in the review or decision process for this paper.</p>","PeriodicalId":50610,"journal":{"name":"Cts-Clinical and Translational Science","volume":"18 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11739455/pdf/","citationCount":"0","resultStr":"{\"title\":\"First, Do No Harm: Addressing AI's Challenges With Out-of-Distribution Data in Medicine\",\"authors\":\"Chu Weng, Wesley Lin, Sherry Dong, Qi Liu, Hanrui Zhang\",\"doi\":\"10.1111/cts.70132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The advent of AI has brought transformative changes across many fields, particularly in biomedical field, where AI is now being used to facilitate drug discovery and development, enhance diagnostic and prognostic accuracy, and support clinical decision-making. For example, since 2021, there has been a notable increase in AI-related submissions to the US Food and Drug Administration (FDA) Center for Drug Evaluation and Research (CDER), reflecting the rapid expansion of AI applications in drug development [<span>1</span>]. In addition, the rapid growth in AI health applications is reflected by the exponential increase in the number of such studies found on PubMed [<span>2</span>]. However, the translation of AI models from development to real-world deployment remains challenging. This is due to various factors, including data drift, where the characteristics of data in the deployment phase differ from those used in model training. Consequently, ensuring the performance of medical AI models in the deployment phase has become a critical area of focus, as AI models that excel in controlled environments may still struggle with real-world variability, leading to poor predictions for patients whose characteristics differ significantly from the training set. Such cases, often referred to as OOD samples, present a major challenge for AI-driven decision-making, such as making diagnosis or selecting treatments for a patient. The failure to recognize these OOD samples can result in suboptimal or even harmful decisions.</p><p>To address this, we propose a prescreening procedure for medical AI model deployment (especially when the AI model risk is high), aimed at avoiding or flagging the predictions by AI models on OOD samples (Figure 1a). This procedure, we believe, can be beneficial for ensuring the trustworthiness of AI in medicine.</p><p>OOD scenarios are a common challenge in medical AI applications. For instance, a model trained predominantly on data from a specific demographic group may underperform when applied to patients from different demographic groups, resulting in inaccurate predictions. OOD cases can also arise when AI models encounter data that differ from the training data due to factors like variations in medical practices and treatment landscapes of the clinical trials. These issues can potentially lead to harm to patients (e.g., misdiagnosis, inappropriate treatment recommendations), and a loss of trust in AI systems.</p><p>The importance of detecting OOD samples to define the scope of use for AI models has been highlighted in multiple research and clinical studies. A well-known example is the Medical Out-of-Distribution-Analysis (MOOD) Challenge [<span>3</span>], which benchmarked OOD detection algorithms across several supervised and unsupervised models, including autoencoder neural networks, U-Net, vector-quantized variational autoencoders, principle component analysis (PCA), and linear Gaussian process regression. These algorithms were used to identify brain magnetic resonance imaging (MRI) and abdominal computed tomography (CT) scan images that deviated from the training data, thereby reducing the risk of overconfident predictions from machine learning models. Similarly, methods such as the Gram matrix algorithm and linear/outlier synthesis have been employed to detect OOD samples in skin lesion images [<span>4</span>].</p><p>Beyond medical imaging, OOD detection has also been recommended for other healthcare data types, such as electronic health records (EHRs), to enhance model reliability [<span>5</span>]. In addition to diagnostic applications, OOD detection can enrich clinical trial cohorts by identifying patients with canonical symptoms. For example, Hopkins et al. used anomaly scores to determine whether patients with bipolar depression should be included in clinical trials for non-racemic amisulpride (SEP-4199). The patients identified as anomalies exhibited distinct responses to the treatment compared to canonical patients [<span>6</span>].</p><p>To demonstrate how OOD detection techniques can be integrated into existing medical AI pipelines, we extend a previously published antimalarial prediction model by incorporating a machine-learning-based OOD detector (Figure 1b). After adding OOD detection, the system exhibits more robust performance when evaluated on transcriptomes from a previously unseen geographic region.</p><p>We had originally trained a tree-based gradient boosting algorithm, LightGBM, using transcriptomes from <i>Plasmodium falciparum</i> isolates obtained from patients, to predict resistance to artemisinin, an antimalarial drug [<span>7</span>]. Briefly, the training data consisted of transcriptomes from isolates in Southeast Asia, alongside the clearance rates of these isolates [<span>8</span>]. Isolates with slow clearance rates were labeled as resistant, while others were classified as non-resistant [<span>7</span>].</p><p>To enhance the model, we incorporated an OOD detection approach that discriminates between in-distribution (ID) and OOD samples. This was done based on the distance between each sample in the latent space of a deep neural network, trained using contrastive learning, and its <i>k</i>-th nearest neighbor in the training set [<span>9</span>]. If the distance exceeded a defined threshold, set such that 5% of the training observations are classified as OOD, the sample was classified as OOD. This approach has demonstrated strong performance in other domains, such as image detection and time series modeling [<span>9, 10</span>].</p><p>To simulate applying a pretrained model in a new setting, we tested artemisinin resistance in a geographically distinct region. We trained our model on 786 transcriptomes from Southeast Asian countries in the Mok et al. dataset, excluding Myanmar. We then validated the pretrained model on samples from Myanmar, using the deep nearest neighbor approach [<span>9</span>] to identify which samples were OOD relative to the training data. We first evaluated the pretrained model on the entire validation set, then removed the OOD samples and reassessed performance to examine the impact.</p><p>During OOD detection, 5 of the 82 samples from Myanmar were identified as OOD. We evaluated the model's predictive performance using AUROC, achieving 0.5973 [0.5508, 0.6605] AUROC on the full Myanmar dataset. After removing the OOD samples, performance improved to 0.6934 [0.6356, 0.7593] AUROC. In contrast, for the five OOD samples, the model's performance was significantly lower, at 0.3310 [0.2153, 0.4421] AUROC—well below random chance (AUROC = 0.5). These results demonstrate that OOD detection effectively identifies samples where the pretrained model's predictions are unreliable and improved the performances by removing unreliable samples in the validation settings.</p><p>AI models need a clearly defined scope of use in terms of input data. This is especially true when moving the AI model from the development stage into a real-world setting. For example, many AI models are developed using data from clinical trials. In general, clinical trial data are generated from highly controlled environments, with carefully selected, relatively homogeneous patient populations. These trials have predetermined durations and predefined endpoints. In contrast, real-world data (RWD) are collected under routine clinical conditions, often in an uncontrolled setting, from patients with diverse demographics, varying disease presentations, and different treatment approaches. The heterogeneity in RWD can present challenges, as AI models may struggle to generalize to broader, more diverse populations. Unlike clinical trials, RWD might not have predefined endpoints for data collection, and it may introduce more variability and potential biases. These discrepancies between clinical trial data and real-world data can lead to suboptimal performance when AI models are applied beyond the scope of their original training.</p><p>Given the high dimensionality of AI model input data, defining the scope of use can become complex. OOD detection can play a valuable role in addressing this challenge. By incorporating OOD detection as a pre-screening step, clinicians can better evaluate whether AI models are suitable for a specific patient, enhancing the safety and effectiveness of AI in medicine. This approach is crucial for reducing the risk of incorrect diagnoses or treatments and ensuring AI is used responsibly in healthcare settings.</p><p>However, OOD detection can be challenging, and current methods still have limitations. In most datasets, there are no explicit labels indicating whether data is OOD. As a result, OOD models are often trained to detect when input data follows a distribution different from the training data, introducing an element of randomness into the detection process. Combining multiple OOD detection models with different random seeds or using models that rely on various mechanisms—such as K-nearest neighbors, isolation forest, Bayesian neural networks, variational autoencoders (VAEs), or contrastive learning—can improve the specificity of OOD detection. More research is needed to further improve the OOD detection methods.</p><p>It is important to note that the OOD approach could incentivize greater inclusivity and diversity in clinical trials and training data. Generally speaking, the more inclusive and diverse the clinical trials and training data are, the broader the scope of use will be for the AI model, as fewer patients will be classified as OOD during the model deployment phase. In addition, the OOD approach can help identify gaps in data inclusivity and offer valuable insights for improving future data collection and enhancing the diversity in clinical trial datasets.</p><p>Looking ahead, the integration of OOD detection into medical AI systems can be an important step toward responsible AI deployment. By explicitly addressing the limitations of our training data and our model capabilities, we can build more trustworthy AI systems that align with the rigorous standards of medical practice and the fundamental principle of “first, do no harm.”</p><p>The authors declare no conflicts of interest.</p><p>This article reflects the views of the author and should not be construed to represent FDA’s views or policies. As an Associate Editor for Clinical and Translational Science, Qi Liu was not involved in the review or decision process for this paper.</p>\",\"PeriodicalId\":50610,\"journal\":{\"name\":\"Cts-Clinical and Translational Science\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-01-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11739455/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cts-Clinical and Translational Science\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/cts.70132\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICINE, RESEARCH & EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cts-Clinical and Translational Science","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/cts.70132","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
First, Do No Harm: Addressing AI's Challenges With Out-of-Distribution Data in Medicine
The advent of AI has brought transformative changes across many fields, particularly in biomedical field, where AI is now being used to facilitate drug discovery and development, enhance diagnostic and prognostic accuracy, and support clinical decision-making. For example, since 2021, there has been a notable increase in AI-related submissions to the US Food and Drug Administration (FDA) Center for Drug Evaluation and Research (CDER), reflecting the rapid expansion of AI applications in drug development [1]. In addition, the rapid growth in AI health applications is reflected by the exponential increase in the number of such studies found on PubMed [2]. However, the translation of AI models from development to real-world deployment remains challenging. This is due to various factors, including data drift, where the characteristics of data in the deployment phase differ from those used in model training. Consequently, ensuring the performance of medical AI models in the deployment phase has become a critical area of focus, as AI models that excel in controlled environments may still struggle with real-world variability, leading to poor predictions for patients whose characteristics differ significantly from the training set. Such cases, often referred to as OOD samples, present a major challenge for AI-driven decision-making, such as making diagnosis or selecting treatments for a patient. The failure to recognize these OOD samples can result in suboptimal or even harmful decisions.
To address this, we propose a prescreening procedure for medical AI model deployment (especially when the AI model risk is high), aimed at avoiding or flagging the predictions by AI models on OOD samples (Figure 1a). This procedure, we believe, can be beneficial for ensuring the trustworthiness of AI in medicine.
OOD scenarios are a common challenge in medical AI applications. For instance, a model trained predominantly on data from a specific demographic group may underperform when applied to patients from different demographic groups, resulting in inaccurate predictions. OOD cases can also arise when AI models encounter data that differ from the training data due to factors like variations in medical practices and treatment landscapes of the clinical trials. These issues can potentially lead to harm to patients (e.g., misdiagnosis, inappropriate treatment recommendations), and a loss of trust in AI systems.
The importance of detecting OOD samples to define the scope of use for AI models has been highlighted in multiple research and clinical studies. A well-known example is the Medical Out-of-Distribution-Analysis (MOOD) Challenge [3], which benchmarked OOD detection algorithms across several supervised and unsupervised models, including autoencoder neural networks, U-Net, vector-quantized variational autoencoders, principle component analysis (PCA), and linear Gaussian process regression. These algorithms were used to identify brain magnetic resonance imaging (MRI) and abdominal computed tomography (CT) scan images that deviated from the training data, thereby reducing the risk of overconfident predictions from machine learning models. Similarly, methods such as the Gram matrix algorithm and linear/outlier synthesis have been employed to detect OOD samples in skin lesion images [4].
Beyond medical imaging, OOD detection has also been recommended for other healthcare data types, such as electronic health records (EHRs), to enhance model reliability [5]. In addition to diagnostic applications, OOD detection can enrich clinical trial cohorts by identifying patients with canonical symptoms. For example, Hopkins et al. used anomaly scores to determine whether patients with bipolar depression should be included in clinical trials for non-racemic amisulpride (SEP-4199). The patients identified as anomalies exhibited distinct responses to the treatment compared to canonical patients [6].
To demonstrate how OOD detection techniques can be integrated into existing medical AI pipelines, we extend a previously published antimalarial prediction model by incorporating a machine-learning-based OOD detector (Figure 1b). After adding OOD detection, the system exhibits more robust performance when evaluated on transcriptomes from a previously unseen geographic region.
We had originally trained a tree-based gradient boosting algorithm, LightGBM, using transcriptomes from Plasmodium falciparum isolates obtained from patients, to predict resistance to artemisinin, an antimalarial drug [7]. Briefly, the training data consisted of transcriptomes from isolates in Southeast Asia, alongside the clearance rates of these isolates [8]. Isolates with slow clearance rates were labeled as resistant, while others were classified as non-resistant [7].
To enhance the model, we incorporated an OOD detection approach that discriminates between in-distribution (ID) and OOD samples. This was done based on the distance between each sample in the latent space of a deep neural network, trained using contrastive learning, and its k-th nearest neighbor in the training set [9]. If the distance exceeded a defined threshold, set such that 5% of the training observations are classified as OOD, the sample was classified as OOD. This approach has demonstrated strong performance in other domains, such as image detection and time series modeling [9, 10].
To simulate applying a pretrained model in a new setting, we tested artemisinin resistance in a geographically distinct region. We trained our model on 786 transcriptomes from Southeast Asian countries in the Mok et al. dataset, excluding Myanmar. We then validated the pretrained model on samples from Myanmar, using the deep nearest neighbor approach [9] to identify which samples were OOD relative to the training data. We first evaluated the pretrained model on the entire validation set, then removed the OOD samples and reassessed performance to examine the impact.
During OOD detection, 5 of the 82 samples from Myanmar were identified as OOD. We evaluated the model's predictive performance using AUROC, achieving 0.5973 [0.5508, 0.6605] AUROC on the full Myanmar dataset. After removing the OOD samples, performance improved to 0.6934 [0.6356, 0.7593] AUROC. In contrast, for the five OOD samples, the model's performance was significantly lower, at 0.3310 [0.2153, 0.4421] AUROC—well below random chance (AUROC = 0.5). These results demonstrate that OOD detection effectively identifies samples where the pretrained model's predictions are unreliable and improved the performances by removing unreliable samples in the validation settings.
AI models need a clearly defined scope of use in terms of input data. This is especially true when moving the AI model from the development stage into a real-world setting. For example, many AI models are developed using data from clinical trials. In general, clinical trial data are generated from highly controlled environments, with carefully selected, relatively homogeneous patient populations. These trials have predetermined durations and predefined endpoints. In contrast, real-world data (RWD) are collected under routine clinical conditions, often in an uncontrolled setting, from patients with diverse demographics, varying disease presentations, and different treatment approaches. The heterogeneity in RWD can present challenges, as AI models may struggle to generalize to broader, more diverse populations. Unlike clinical trials, RWD might not have predefined endpoints for data collection, and it may introduce more variability and potential biases. These discrepancies between clinical trial data and real-world data can lead to suboptimal performance when AI models are applied beyond the scope of their original training.
Given the high dimensionality of AI model input data, defining the scope of use can become complex. OOD detection can play a valuable role in addressing this challenge. By incorporating OOD detection as a pre-screening step, clinicians can better evaluate whether AI models are suitable for a specific patient, enhancing the safety and effectiveness of AI in medicine. This approach is crucial for reducing the risk of incorrect diagnoses or treatments and ensuring AI is used responsibly in healthcare settings.
However, OOD detection can be challenging, and current methods still have limitations. In most datasets, there are no explicit labels indicating whether data is OOD. As a result, OOD models are often trained to detect when input data follows a distribution different from the training data, introducing an element of randomness into the detection process. Combining multiple OOD detection models with different random seeds or using models that rely on various mechanisms—such as K-nearest neighbors, isolation forest, Bayesian neural networks, variational autoencoders (VAEs), or contrastive learning—can improve the specificity of OOD detection. More research is needed to further improve the OOD detection methods.
It is important to note that the OOD approach could incentivize greater inclusivity and diversity in clinical trials and training data. Generally speaking, the more inclusive and diverse the clinical trials and training data are, the broader the scope of use will be for the AI model, as fewer patients will be classified as OOD during the model deployment phase. In addition, the OOD approach can help identify gaps in data inclusivity and offer valuable insights for improving future data collection and enhancing the diversity in clinical trial datasets.
Looking ahead, the integration of OOD detection into medical AI systems can be an important step toward responsible AI deployment. By explicitly addressing the limitations of our training data and our model capabilities, we can build more trustworthy AI systems that align with the rigorous standards of medical practice and the fundamental principle of “first, do no harm.”
The authors declare no conflicts of interest.
This article reflects the views of the author and should not be construed to represent FDA’s views or policies. As an Associate Editor for Clinical and Translational Science, Qi Liu was not involved in the review or decision process for this paper.
期刊介绍:
Clinical and Translational Science (CTS), an official journal of the American Society for Clinical Pharmacology and Therapeutics, highlights original translational medicine research that helps bridge laboratory discoveries with the diagnosis and treatment of human disease. Translational medicine is a multi-faceted discipline with a focus on translational therapeutics. In a broad sense, translational medicine bridges across the discovery, development, regulation, and utilization spectrum. Research may appear as Full Articles, Brief Reports, Commentaries, Phase Forwards (clinical trials), Reviews, or Tutorials. CTS also includes invited didactic content that covers the connections between clinical pharmacology and translational medicine. Best-in-class methodologies and best practices are also welcomed as Tutorials. These additional features provide context for research articles and facilitate understanding for a wide array of individuals interested in clinical and translational science. CTS welcomes high quality, scientifically sound, original manuscripts focused on clinical pharmacology and translational science, including animal, in vitro, in silico, and clinical studies supporting the breadth of drug discovery, development, regulation and clinical use of both traditional drugs and innovative modalities.