Artificial intelligence in the life sciences最新文献_第2页

LAGOM: A transformer-based chemical language model for drug metabolite prediction LAGOM：基于转换器的药物代谢预测化学语言模型

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-09-17 DOI: 10.1016/j.ailsci.2025.100142

Sofia Larsson , Miranda Carlsson , Richard Beckmann , Filip Miljković , Rocío Mercado

Metabolite identification studies are an essential but costly and time-consuming component of drug development. Computational methods have the potential to accelerate early-stage drug discovery, particularly with recent advances in deep learning which offer new opportunities to accelerate the process of metabolite prediction. We present LAGOM (Language-model Assisted Generation Of Metabolites), a Transformer-based approach built upon the Chemformer architecture, designed to predict likely metabolic transformations of drug candidates. Our results show that LAGOM performs competitively with, and in some cases surpasses, existing state-of-the-art metabolite prediction tools, demonstrating the potential of language-model-based architectures in chemoinformatics. By integrating diverse data sources and employing data augmentation strategies, we further improve the model’s generalisation and predictive accuracy. The implementation of LAGOM is publicly available at github.com/tsofiac/LAGOM.

代谢物鉴定研究是药物开发中一项必要但昂贵且耗时的组成部分。计算方法有可能加速早期药物发现，特别是最近深度学习的进展为加速代谢物预测过程提供了新的机会。我们提出LAGOM（语言模型辅助代谢物生成），这是一种基于transformer的方法，建立在Chemformer架构之上，旨在预测候选药物可能的代谢转化。我们的研究结果表明，LAGOM与现有的最先进的代谢物预测工具相比具有竞争力，在某些情况下甚至超过了这些工具，这证明了基于语言模型的架构在化学信息学中的潜力。通过整合不同的数据源和采用数据增强策略，我们进一步提高了模型的泛化和预测精度。LAGOM的实现可以在github.com/tsofiac/LAGOM上公开获得。

引用次数: 0

MegaEye: Applying multiple machine learning approaches to identify oral compounds with ocular bioactivity MegaEye：应用多种机器学习方法识别具有眼部生物活性的口服化合物

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-09-15 DOI: 10.1016/j.ailsci.2025.100143

Fabio Urbina , Scott H. Greenwald , Patricia A. Vignaux , Thomas R. Lane , Joshua S. Harris , Mayssa Attar , Keith Luhrs , Sean Ekins

The eye is a complex organ with the critical role of mediating the optical and initial signal processing steps of vision. As such, the eye has multiple physiological and dynamic barriers to protect ocular tissues and compartments. Oral administration of pharmacological agents to treat ocular diseases have often failed to demonstrate efficacy in clinical trials. The ability of a molecule to reach a specific target in the eye (e.g. cells in the anterior versus posterior segment) is largely determined by whether its physicochemical properties permit passage across the various ocular barriers (e.g. cornea, sclera, tear dilution, blood-retinal barrier, lymphatic outflow) that are relevant to the route of administration and the target location. The use of machine learning to predict ocular bioactivity of molecules is underexplored. We now describe the curation of several datasets, generated by a wide array of computational approaches, that are used to identify drugs predicted to reach the eye following oral delivery. These datasets included simple molecular properties (e.g. molecular weight), using the blood-brain barrier MPO score, and machine learning models as a proxy for the blood-retinal barrier using transporter and other relevant literature datasets. FDA approved drugs with reported ocular activity were used to validate the models’ ability to identify additional molecules not in the models. Finally, we used a large language model, to rank over 400,000 natural compounds by potential activity in the eye. In summary, we illustrate machine learning model applications that can be expanded for ocular applications in future to repurpose molecules.

眼睛是一个复杂的器官，在视觉的光学和初始信号处理过程中起着至关重要的作用。因此，眼睛具有多种生理和动态屏障来保护眼部组织和隔室。口服药物治疗眼部疾病在临床试验中常常不能证明其疗效。一种分子到达眼部特定靶点（如前段细胞与后段细胞）的能力在很大程度上取决于其物理化学性质是否允许通过与给药途径和靶点位置相关的各种眼屏障（如角膜、巩膜、泪液稀释、血视网膜屏障、淋巴流出）。利用机器学习来预测分子的眼部生物活性尚未得到充分的探索。我们现在描述了几个数据集的管理，这些数据集是由一系列广泛的计算方法生成的，用于识别口服给药后预测到达眼睛的药物。这些数据集包括简单的分子特性（例如分子量），使用血脑屏障MPO评分，以及使用转运蛋白和其他相关文献数据集作为血视网膜屏障代理的机器学习模型。FDA批准的具有眼活动的药物被用来验证模型识别模型中没有的其他分子的能力。最后，我们使用了一个大型语言模型，根据眼睛的潜在活动对40多万种天然化合物进行了排名。总之，我们说明了机器学习模型的应用，可以扩展到未来的眼部应用，以重新定位分子。

{"title":"MegaEye: Applying multiple machine learning approaches to identify oral compounds with ocular bioactivity","authors":"Fabio Urbina , Scott H. Greenwald , Patricia A. Vignaux , Thomas R. Lane , Joshua S. Harris , Mayssa Attar , Keith Luhrs , Sean Ekins","doi":"10.1016/j.ailsci.2025.100143","DOIUrl":"10.1016/j.ailsci.2025.100143","url":null,"abstract":"<div><div>The eye is a complex organ with the critical role of mediating the optical and initial signal processing steps of vision. As such, the eye has multiple physiological and dynamic barriers to protect ocular tissues and compartments. Oral administration of pharmacological agents to treat ocular diseases have often failed to demonstrate efficacy in clinical trials. The ability of a molecule to reach a specific target in the eye (e.g. cells in the anterior versus posterior segment) is largely determined by whether its physicochemical properties permit passage across the various ocular barriers (e.g. cornea, sclera, tear dilution, blood-retinal barrier, lymphatic outflow) that are relevant to the route of administration and the target location. The use of machine learning to predict ocular bioactivity of molecules is underexplored. We now describe the curation of several datasets, generated by a wide array of computational approaches, that are used to identify drugs predicted to reach the eye following oral delivery. These datasets included simple molecular properties (e.g. molecular weight), using the blood-brain barrier MPO score, and machine learning models as a proxy for the blood-retinal barrier using transporter and other relevant literature datasets. FDA approved drugs with reported ocular activity were used to validate the models’ ability to identify additional molecules not in the models. Finally, we used a large language model, to rank over 400,000 natural compounds by potential activity in the eye. In summary, we illustrate machine learning model applications that can be expanded for ocular applications in future to repurpose molecules.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100143"},"PeriodicalIF":5.4,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145104219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

From task-specific language models to research agents 从特定任务的语言模型到研究代理

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-08-26 DOI: 10.1016/j.ailsci.2025.100141

Jürgen Bajorath

引用次数: 0

A crossover-enhanced Marine Predators Algorithm for gene selection in microarray-based cancer classification 基于微阵列的癌症分类中基因选择的交叉增强海洋捕食者算法

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-08-23 DOI: 10.1016/j.ailsci.2025.100140

Sharif Naser Makhadmeh , Yousef Sanjalawe , Mohammed Azmi Al-Betar , Ahmad Nasayreh , Mohammad Aladaileh

The DNA microarray technique involves using a chip embedded with numerous DNA sequences to simultaneously estimate the expression of a multitude of genes. This data, laid out in table format, is vital for employing pattern recognition algorithms that distinguish between samples from healthy individuals and those with cancer. However, identifying useful biomarkers within gene selection data presents significant challenges due to its vast dimensionality and the inclusion of noisy, irrelevant genes. To address these challenges, this paper introduces a sophisticated gene selection method using a robust filter called Minimum redundancy maximum relevancy, combined with a novel hybrid optimization algorithm. This algorithm integrates the Improved Marine Predator Optimizer (MPA) with the Crossover operator to form the MPAC method. The MPAC specifically aims to identify a concise set of biomarker genes that substantially improve cancer classification performance. It employs the k-nearest neighbor algorithm for classification tasks. The innovation in MPAC lies in its ability to significantly enhance the performance of the MPA’s search agents. It seeks the most effective gene subsets for cancer biomarkers and is designed to optimize both the depth (exploitation) and breadth (exploration) of the search. The effectiveness of this hybrid approach is rigorously tested against nine well-known microarray datasets. The performance of this hybrid model is compared against other base and advanced optimization algorithms. The findings from these comparisons highlight that the proposed MPAC approach excels in most of the datasets and remains highly competitive across the others.

DNA微阵列技术包括使用嵌入大量DNA序列的芯片来同时估计大量基因的表达。这些以表格形式列出的数据对于使用模式识别算法区分健康个体和癌症患者的样本至关重要。然而，在基因选择数据中识别有用的生物标志物面临着巨大的挑战，因为它的巨大维度和包含嘈杂的，不相关的基因。为了解决这些挑战，本文介绍了一种复杂的基因选择方法，该方法使用称为最小冗余最大相关性的鲁棒滤波器，并结合了一种新的混合优化算法。该算法将改进的海洋掠食者优化器（MPA）与交叉算子相结合，形成了MPAC方法。MPAC特别旨在鉴定一组简明的生物标记基因，这些基因可以大大提高癌症分类的性能。它采用k近邻算法进行分类任务。MPAC的创新之处在于它能够显著提高MPA搜索代理的性能。它为癌症生物标志物寻找最有效的基因亚群，旨在优化搜索的深度（开发）和广度（探索）。这种混合方法的有效性经过了针对九个知名微阵列数据集的严格测试。将该混合模型的性能与其他基本和先进的优化算法进行了比较。这些比较的结果突出表明，所提出的MPAC方法在大多数数据集中表现优异，并且在其他数据集中保持高度竞争力。

{"title":"A crossover-enhanced Marine Predators Algorithm for gene selection in microarray-based cancer classification","authors":"Sharif Naser Makhadmeh , Yousef Sanjalawe , Mohammed Azmi Al-Betar , Ahmad Nasayreh , Mohammad Aladaileh","doi":"10.1016/j.ailsci.2025.100140","DOIUrl":"10.1016/j.ailsci.2025.100140","url":null,"abstract":"<div><div>The DNA microarray technique involves using a chip embedded with numerous DNA sequences to simultaneously estimate the expression of a multitude of genes. This data, laid out in table format, is vital for employing pattern recognition algorithms that distinguish between samples from healthy individuals and those with cancer. However, identifying useful biomarkers within gene selection data presents significant challenges due to its vast dimensionality and the inclusion of noisy, irrelevant genes. To address these challenges, this paper introduces a sophisticated gene selection method using a robust filter called Minimum redundancy maximum relevancy, combined with a novel hybrid optimization algorithm. This algorithm integrates the Improved Marine Predator Optimizer (MPA) with the Crossover operator to form the MPAC method. The MPAC specifically aims to identify a concise set of biomarker genes that substantially improve cancer classification performance. It employs the k-nearest neighbor algorithm for classification tasks. The innovation in MPAC lies in its ability to significantly enhance the performance of the MPA’s search agents. It seeks the most effective gene subsets for cancer biomarkers and is designed to optimize both the depth (exploitation) and breadth (exploration) of the search. The effectiveness of this hybrid approach is rigorously tested against nine well-known microarray datasets. The performance of this hybrid model is compared against other base and advanced optimization algorithms. The findings from these comparisons highlight that the proposed MPAC approach excels in most of the datasets and remains highly competitive across the others.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100140"},"PeriodicalIF":5.4,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A machine learning framework for the prediction and analysis of bacterial antagonism in biofilms using morphological descriptors 使用形态描述符预测和分析生物膜中细菌拮抗作用的机器学习框架

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-08-20 DOI: 10.1016/j.ailsci.2025.100137

Raphaël Rubrice , Virgile Gueneau , Romain Briandet , Antoine Cornuejols , Vincent Guigue

Biofilms are structured microbial communities that promote cell interactions through close spatial organization, leading to cooperative or competitive behaviors. Predicting microbial interactions in biofilms could aid in developing innovative strategies to prevent the colonization of undesirable bacteria. Here, we present a machine learning approach to predict the antagonistic effects of beneficial bacterial candidates Bacillus and Paenibacillus species against undesirable bacteria (Staphylococcus aureus, Enterococcus cecorum, Escherichia coli and Salmonella enterica), based on the morphological descriptors of single-species biofilms. We trained the models using quantitative features (e.g. biofilm volume, thickness, roughness, or substratum coverage). As a proxy for antagonism, an exclusion score was used as the supervised training target. The latter was calculated based on the ratio of biofilm volume between the undesirable bacteria and the beneficial strain. We subsequently applied diverse explainability methods to analyze the resulting model and found insights highlighting the importance of biofilm formation context when predicting antagonism. Our results demonstrate that machine learning offers an efficient, data-driven tool to predict microbial interactions within biofilms and support the selection of competitive beneficial strains against pathogens. This approach enables scalable screening of microbial interactions, making it applicable to both research and biotechnological applications.

生物膜是结构化的微生物群落，通过紧密的空间组织促进细胞相互作用，导致合作或竞争行为。预测微生物在生物膜中的相互作用有助于开发创新策略，以防止不良细菌的定植。在这里，我们提出了一种机器学习方法来预测有益细菌候选芽孢杆菌和芽孢杆菌物种对不良细菌（金黄色葡萄球菌、盲肠球菌、大肠杆菌和肠炎沙门氏菌）的拮抗作用，基于单种生物膜的形态描述符。我们使用定量特征（如生物膜体积、厚度、粗糙度或基质覆盖率）训练模型。作为对抗的代理，排除分数被用作监督训练目标。后者是根据有害细菌与有益菌株之间的生物膜体积之比计算的。随后，我们应用了多种可解释性方法来分析所得模型，并发现了在预测拮抗作用时强调生物膜形成背景重要性的见解。我们的研究结果表明，机器学习提供了一种有效的、数据驱动的工具来预测生物膜内的微生物相互作用，并支持对病原体的竞争性有益菌株的选择。这种方法使微生物相互作用的可扩展筛选，使其适用于研究和生物技术应用。

{"title":"A machine learning framework for the prediction and analysis of bacterial antagonism in biofilms using morphological descriptors","authors":"Raphaël Rubrice , Virgile Gueneau , Romain Briandet , Antoine Cornuejols , Vincent Guigue","doi":"10.1016/j.ailsci.2025.100137","DOIUrl":"10.1016/j.ailsci.2025.100137","url":null,"abstract":"<div><div>Biofilms are structured microbial communities that promote cell interactions through close spatial organization, leading to cooperative or competitive behaviors. Predicting microbial interactions in biofilms could aid in developing innovative strategies to prevent the colonization of undesirable bacteria. Here, we present a machine learning approach to predict the antagonistic effects of beneficial bacterial candidates <em>Bacillus</em> and <em>Paenibacillus</em> species against undesirable bacteria (<em>Staphylococcus aureus</em>, <em>Enterococcus cecorum</em>, <em>Escherichia coli</em> and <em>Salmonella enterica</em>), based on the morphological descriptors of single-species biofilms. We trained the models using quantitative features (e.g. biofilm volume, thickness, roughness, or substratum coverage). As a proxy for antagonism, an exclusion score was used as the supervised training target. The latter was calculated based on the ratio of biofilm volume between the undesirable bacteria and the beneficial strain. We subsequently applied diverse explainability methods to analyze the resulting model and found insights highlighting the importance of biofilm formation context when predicting antagonism. Our results demonstrate that machine learning offers an efficient, data-driven tool to predict microbial interactions within biofilms and support the selection of competitive beneficial strains against pathogens. This approach enables scalable screening of microbial interactions, making it applicable to both research and biotechnological applications.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100137"},"PeriodicalIF":5.4,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144893291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A dystopian nightmare for science and how to survive it 科学的反乌托邦噩梦以及如何生存

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-08-05 DOI: 10.1016/j.ailsci.2025.100139

Sean Ekins

Let me assert my firm belief that the only thing we have to fear is fear itself - Franklin D. Roosevelt (FDR). This is a message of hope written by FDR for his inaugural address in 1933 amid the great depression, one could argue we are now witnessing a great depression in science in the USA.

让我表明我坚定的信念：我们唯一不得不害怕的就是害怕本身。——富兰克林·d·罗斯福这是罗斯福在1933年大萧条时期的就职演说中所写的充满希望的信息，有人可能会说，我们现在正在见证美国科学界的大萧条。

引用次数: 0

Unraveling the co-morbidity between COVID-19 and neurodegenerative diseases through multi-scale graph analysis: A systematic investigation of biological databases and text mining 通过多尺度图分析揭示COVID-19与神经退行性疾病的共发病：生物数据库和文本挖掘的系统调查

IF 5.4

Artificial intelligence in the life sciences

Pub Date : 2025-07-28 DOI: 10.1016/j.ailsci.2025.100138

Negin Sadat Babaiha , Stefan Geissler , Vincent Nibart , Heval Atas Güvenilir , Vinay Srinivas Bharadhwaj , Alpha Tom Kodamullil , Juergen Klein , Marc Jacobs , Martin Hofmann-Apitius

The COVID-19 pandemic has generated a vast volume of research, yet much of it focuses on individual diseases, overlooking complex comorbidity relationships. While extensive literature exists on both neurodegenerative diseases (NDDs), such as Alzheimer’s and Parkinson’s, and COVID-19, their intersection remains underexplored. Co-morbidity modeling is crucial, particularly for hospitalized patients often presenting with multiple conditions. This study investigates the interplay between COVID-19 and NDDs by integrating knowledge graphs (KGs) built from curated biomedical datasets and text mining tools. We performed comprehensive analyses—including path analysis, phenotype coverage, and mapping of cellular and genetic factors—across multiple KGs, such as PrimeKG, DrugBank, OpenTargets, and those generated via natural language processing (NLP) methods. Our findings reveal notable variability in graph density and connectivity, with each KG offering unique insights into molecular and phenotypic links between COVID-19 and NDDs. Key genetic and inflammatory markers, especially immune response genes, consistently appeared across graphs, suggesting a shared pathogenic basis. By unifying structured biological data with unstructured textual evidence, we enhance co-morbidity modeling and improve recall in identifying mechanisms underlying COVID-19–NDD interactions. This integrative framework supports the development of a co-morbidity hypothesis database aimed at facilitating therapeutic target discovery. All data, methods, and instructions for accessing the co-morbidity hypothesis database are publicly available at: https://github.com/SCAI-BIO/covid-NDD-comorbidity-NLP.

2019冠状病毒病大流行引发了大量研究，但其中大部分集中在个体疾病上，忽视了复杂的共病关系。虽然关于阿尔茨海默病和帕金森病等神经退行性疾病（ndd）以及COVID-19的文献很多，但它们的交集仍未得到充分探讨。合并症建模是至关重要的，特别是对于经常出现多种情况的住院患者。本研究通过整合从生物医学数据集和文本挖掘工具构建的知识图（KGs），调查了COVID-19与ndd之间的相互作用。我们对多个kg（如PrimeKG、DrugBank、OpenTargets和通过自然语言处理（NLP）方法生成的kg）进行了全面的分析，包括通径分析、表型覆盖以及细胞和遗传因素的映射。我们的研究结果揭示了图密度和连通性的显着差异，每个KG都为COVID-19与ndd之间的分子和表型联系提供了独特的见解。关键的遗传和炎症标志物，特别是免疫反应基因，一致地出现在图中，表明一个共同的致病基础。通过将结构化生物学数据与非结构化文本证据统一起来，我们增强了共发病建模，并提高了识别COVID-19-NDD相互作用机制的召回率。这一综合框架支持共同发病假设数据库的发展，旨在促进治疗靶点的发现。所有的数据，方法和说明访问合并症假设数据库是公开的：https://github.com/SCAI-BIO/covid-NDD-comorbidity-NLP。

{"title":"Unraveling the co-morbidity between COVID-19 and neurodegenerative diseases through multi-scale graph analysis: A systematic investigation of biological databases and text mining","authors":"Negin Sadat Babaiha , Stefan Geissler , Vincent Nibart , Heval Atas Güvenilir , Vinay Srinivas Bharadhwaj , Alpha Tom Kodamullil , Juergen Klein , Marc Jacobs , Martin Hofmann-Apitius","doi":"10.1016/j.ailsci.2025.100138","DOIUrl":"10.1016/j.ailsci.2025.100138","url":null,"abstract":"<div><div>The COVID-19 pandemic has generated a vast volume of research, yet much of it focuses on individual diseases, overlooking complex comorbidity relationships. While extensive literature exists on both neurodegenerative diseases (NDDs), such as Alzheimer’s and Parkinson’s, and COVID-19, their intersection remains underexplored. Co-morbidity modeling is crucial, particularly for hospitalized patients often presenting with multiple conditions. This study investigates the interplay between COVID-19 and NDDs by integrating knowledge graphs (KGs) built from curated biomedical datasets and text mining tools. We performed comprehensive analyses—including path analysis, phenotype coverage, and mapping of cellular and genetic factors—across multiple KGs, such as PrimeKG, DrugBank, OpenTargets, and those generated via natural language processing (NLP) methods. Our findings reveal notable variability in graph density and connectivity, with each KG offering unique insights into molecular and phenotypic links between COVID-19 and NDDs. Key genetic and inflammatory markers, especially immune response genes, consistently appeared across graphs, suggesting a shared pathogenic basis. By unifying structured biological data with unstructured textual evidence, we enhance co-morbidity modeling and improve recall in identifying mechanisms underlying COVID-19–NDD interactions. This integrative framework supports the development of a co-morbidity hypothesis database aimed at facilitating therapeutic target discovery. All data, methods, and instructions for accessing the co-morbidity hypothesis database are publicly available at: <span><span>https://github.com/SCAI-BIO/covid-NDD-comorbidity-NLP</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100138"},"PeriodicalIF":5.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144748670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal distribution shift in real-world pharmaceutical data: Implications for uncertainty quantification in QSAR models 现实世界制药数据的时间分布变化：QSAR模型中不确定性量化的含义

Artificial intelligence in the life sciences

Pub Date : 2025-07-10 DOI: 10.1016/j.ailsci.2025.100132

Hannah Rosa Friesacher , Emma Svensson , Susanne Winiwarter , Lewis Mervin , Adam Arany , Ola Engkvist

The estimation of uncertainties associated with predictions from quantitative structure–activity relationship (QSAR) models can accelerate the drug discovery process by identifying promising experiments and allowing an efficient allocation of resources. Several computational tools exist that estimate the predictive uncertainty in machine learning models. However, deviations from the i.i.d. setting have been shown to impair the performance of these uncertainty quantification methods. We use a real-world pharmaceutical dataset to address the pressing need for a comprehensive, large-scale evaluation of uncertainty quantification approaches in the context of realistic distribution shifts over time. We investigate the performance of several popular uncertainty estimation methods for classification models, including ensemble-based and Bayesian approaches. Furthermore, we use this real-world setting to systematically assess the distribution shifts in label and descriptor space and their impact on the capability of the uncertainty quantification methods. Our study reveals significant shifts over time in both label and descriptor space and a clear connection between the magnitude of the shift and the nature of the assay. Moreover, we show that pronounced distribution shifts impair the performance of popular uncertainty quantification methods used in QSAR models. This work highlights the challenges of identifying uncertainty quantification techniques that remain reliable under distribution shifts introduced by real-world data.

定量构效关系（QSAR）模型预测的不确定性估计可以通过确定有前途的实验和允许有效分配资源来加速药物发现过程。有几种计算工具可以估计机器学习模型中的预测不确定性。然而，与i.i.d设置的偏差已被证明会损害这些不确定度量化方法的性能。我们使用真实世界的制药数据集来解决在现实分布随时间变化的背景下对不确定性量化方法进行全面、大规模评估的迫切需要。我们研究了几种常用的分类模型不确定性估计方法的性能，包括基于集成的方法和贝叶斯方法。此外，我们使用这个真实世界的设置来系统地评估标签和描述符空间中的分布变化及其对不确定性量化方法能力的影响。我们的研究揭示了显著的变化随着时间的推移，在标签和描述符的空间和转移的幅度和化验的性质之间的明确联系。此外，我们还表明，明显的分布变化会损害QSAR模型中常用的不确定性量化方法的性能。这项工作强调了识别不确定性量化技术的挑战，这些技术在现实世界数据引入的分布变化下仍然可靠。

{"title":"Temporal distribution shift in real-world pharmaceutical data: Implications for uncertainty quantification in QSAR models","authors":"Hannah Rosa Friesacher , Emma Svensson , Susanne Winiwarter , Lewis Mervin , Adam Arany , Ola Engkvist","doi":"10.1016/j.ailsci.2025.100132","DOIUrl":"10.1016/j.ailsci.2025.100132","url":null,"abstract":"<div><div>The estimation of uncertainties associated with predictions from quantitative structure–activity relationship (QSAR) models can accelerate the drug discovery process by identifying promising experiments and allowing an efficient allocation of resources. Several computational tools exist that estimate the predictive uncertainty in machine learning models. However, deviations from the i.i.d. setting have been shown to impair the performance of these uncertainty quantification methods. We use a real-world pharmaceutical dataset to address the pressing need for a comprehensive, large-scale evaluation of uncertainty quantification approaches in the context of realistic distribution shifts over time. We investigate the performance of several popular uncertainty estimation methods for classification models, including ensemble-based and Bayesian approaches. Furthermore, we use this real-world setting to systematically assess the distribution shifts in label and descriptor space and their impact on the capability of the uncertainty quantification methods. Our study reveals significant shifts over time in both label and descriptor space and a clear connection between the magnitude of the shift and the nature of the assay. Moreover, we show that pronounced distribution shifts impair the performance of popular uncertainty quantification methods used in QSAR models. This work highlights the challenges of identifying uncertainty quantification techniques that remain reliable under distribution shifts introduced by real-world data.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100132"},"PeriodicalIF":0.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artificial intelligence methods and approaches to improve data quality in helthcare data 提高医疗保健数据质量的人工智能方法和途径

Artificial intelligence in the life sciences

Pub Date : 2025-07-04 DOI: 10.1016/j.ailsci.2025.100135

Jarmakoviča Agate

This study explores artificial intelligence (AI) methods and approaches used to improve data quality, with a particular focus on healthcare data. Applying a systematic literature review based on the PRISMA framework, the research examines publications from 2020 to 2025 that analyze AI applications across key data quality dimensions—accuracy, completeness, consistency, timeliness, uniqueness, and validity. The study aims to identify which AI methods are most commonly employed and how they align with these quality attributes. A conceptual map was developed to visualize the relationships between dimensions and AI techniques such as deep learning, federated learning, data-centric AI, and ontology-based data governance. Findings reveal that accuracy and consistency are the most emphasized dimensions in the literature, with methods like supervised learning, NLP, and isolation forest frequently applied. In contrast, dimensions like timeliness and validity receive comparatively limited attention. The study concludes that certain AI methods—particularly data-centric and cross-cutting approaches—are effective in addressing multiple data quality challenges simultaneously. These insights offer practical guidance for selecting AI strategies in healthcare data quality improvement and highlight areas for future research.

本研究探讨了用于提高数据质量的人工智能（AI）方法和方法，特别关注医疗保健数据。该研究基于PRISMA框架进行了系统的文献综述，审查了2020年至2025年的出版物，这些出版物从关键数据质量维度（准确性、完整性、一致性、及时性、独特性和有效性）分析了人工智能应用。该研究旨在确定最常用的人工智能方法，以及它们如何与这些质量属性保持一致。开发了一个概念图来可视化维度与人工智能技术（如深度学习、联邦学习、以数据为中心的人工智能和基于本体的数据治理）之间的关系。研究结果表明，准确性和一致性是文献中最强调的维度，经常使用监督学习、自然语言处理和隔离森林等方法。相比之下，时效性和有效性等维度受到的关注相对有限。该研究得出结论，某些人工智能方法——特别是以数据为中心和跨领域的方法——在同时应对多种数据质量挑战方面是有效的。这些见解为选择医疗保健数据质量改进中的人工智能策略提供了实用指导，并突出了未来研究的领域。

{"title":"Artificial intelligence methods and approaches to improve data quality in helthcare data","authors":"Jarmakoviča Agate","doi":"10.1016/j.ailsci.2025.100135","DOIUrl":"10.1016/j.ailsci.2025.100135","url":null,"abstract":"<div><div>This study explores artificial intelligence (AI) methods and approaches used to improve data quality, with a particular focus on healthcare data. Applying a systematic literature review based on the PRISMA framework, the research examines publications from 2020 to 2025 that analyze AI applications across key data quality dimensions—accuracy, completeness, consistency, timeliness, uniqueness, and validity. The study aims to identify which AI methods are most commonly employed and how they align with these quality attributes. A conceptual map was developed to visualize the relationships between dimensions and AI techniques such as deep learning, federated learning, data-centric AI, and ontology-based data governance. Findings reveal that accuracy and consistency are the most emphasized dimensions in the literature, with methods like supervised learning, NLP, and isolation forest frequently applied. In contrast, dimensions like timeliness and validity receive comparatively limited attention. The study concludes that certain AI methods—particularly data-centric and cross-cutting approaches—are effective in addressing multiple data quality challenges simultaneously. These insights offer practical guidance for selecting AI strategies in healthcare data quality improvement and highlight areas for future research.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100135"},"PeriodicalIF":0.0,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Co-folding, the future of docking – prediction of allosteric and orthosteric ligands 共折叠，对接的未来——变构配体和正构配体的预测

Artificial intelligence in the life sciences

Pub Date : 2025-06-29 DOI: 10.1016/j.ailsci.2025.100136

Eva Nittinger , Özge Yoluk , Alessandro Tibo , Gustav Olanders , Christian Tyrchan

In drug discovery understanding protein structures is essential for comprehending their functions and interactions with drugs. Traditional methods like X-ray crystallography and cryo-electron microscopy have been used to solve these structures. Recently, computational biology has seen a breakthrough with deep learning algorithms capable of predicting protein structures based on amino acid sequences. These methods have now evolved into predicting protein-ligand interactions from sequence – co-folding methods. Despite the great advancement in the field during the last year, there are still open challenges. Here, we focus on the prediction of allosteric binding sites, using a dataset of 17 orthosteric/allosteric ligand sets. Three different co-folding methods – NeuralPLexer, RoseTTAFold All-Atom and Boltz-1/Boltz-1x – were used to predict both allosteric and orthosteric ligands. Using PoseBusters, the ligand quality was checked, with >90 % of ligands predicted by Boltz-1x passing the default quality criteria. Boltz-1, NeuralPLexer and RoseTTAFold All-Atom still showing high quality drawbacks. The orthosteric ligands were well placed. However, instead of the allosteric pocket these deep learning approaches generally favor the orthosteric site, which is the one most represented in the training data.

在药物发现中，了解蛋白质结构对于理解它们的功能和与药物的相互作用至关重要。传统的方法，如x射线晶体学和低温电子显微镜已经被用来解决这些结构。最近，计算生物学在能够基于氨基酸序列预测蛋白质结构的深度学习算法方面取得了突破。这些方法现在已经从序列共折叠方法发展到预测蛋白质与配体的相互作用。尽管这一领域在去年取得了巨大的进步，但仍然存在着开放的挑战。在这里，我们专注于预测变构结合位点，使用17个正构/变构配体集的数据集。三种不同的共折叠方法- NeuralPLexer， RoseTTAFold All-Atom和Boltz-1/Boltz-1x -被用来预测变构配体和正构配体。使用PoseBusters检查配体质量，Boltz-1x预测的配体中有>； 90%通过默认质量标准。Boltz-1， NeuralPLexer和RoseTTAFold All-Atom仍然显示出高质量的缺点。正位配体位置良好。然而，这些深度学习方法通常倾向于在训练数据中最具代表性的正构位点，而不是变构袋。

{"title":"Co-folding, the future of docking – prediction of allosteric and orthosteric ligands","authors":"Eva Nittinger , Özge Yoluk , Alessandro Tibo , Gustav Olanders , Christian Tyrchan","doi":"10.1016/j.ailsci.2025.100136","DOIUrl":"10.1016/j.ailsci.2025.100136","url":null,"abstract":"<div><div>In drug discovery understanding protein structures is essential for comprehending their functions and interactions with drugs. Traditional methods like X-ray crystallography and cryo-electron microscopy have been used to solve these structures. Recently, computational biology has seen a breakthrough with deep learning algorithms capable of predicting protein structures based on amino acid sequences. These methods have now evolved into predicting protein-ligand interactions from sequence – co-folding methods. Despite the great advancement in the field during the last year, there are still open challenges. Here, we focus on the prediction of allosteric binding sites, using a dataset of 17 orthosteric/allosteric ligand sets. Three different co-folding methods – NeuralPLexer, RoseTTAFold All-Atom and Boltz-1/Boltz-1x – were used to predict both allosteric and orthosteric ligands. Using PoseBusters, the ligand quality was checked, with >90 % of ligands predicted by Boltz-1x passing the default quality criteria. Boltz-1, NeuralPLexer and RoseTTAFold All-Atom still showing high quality drawbacks. The orthosteric ligands were well placed. However, instead of the allosteric pocket these deep learning approaches generally favor the orthosteric site, which is the one most represented in the training data.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"8 ","pages":"Article 100136"},"PeriodicalIF":0.0,"publicationDate":"2025-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0