首页 > 最新文献

Artificial intelligence in the life sciences最新文献

英文 中文
Deep metric learning for the classification of MALDI-TOF spectral signatures from multiple species of neotropical disease vectors 多种新热带病媒MALDI-TOF谱特征分类的深度度量学习
Pub Date : 2023-04-06 DOI: 10.1016/j.ailsci.2023.100071
Fernando Merchan , Kenji Contreras , Rolando A. Gittens , Jose R. Loaiza , Javier E. Sanchez-Galan

Deep Learning techniques have significant advantages for mass spectral classification, such as parallelized signal correction and feature extraction. Deep Metric Learning models combine Metric Learning to determine the degree of similarity or difference between a set of mass spectra with the generalization power of Deep Learning to improve feature extraction even further. The two most popular of these models combine multiple neural networks with identical architectures and are commonly called Siamese (SNN) and Triplet Neural Networks (TNN). Herein, using both SNNs and TNNs, we intended to taxonomically categorize two sets of previously-validated mass spectra that corresponded to 30 species of Neotropical arthropods in the Culicidae and Ixodidae families, some of which are disease vectors. The effectiveness of SNNs and TNNs to correctly classify 826 spectra from 12 mosquito species and 310 spectra from 18 species of hard ticks was highly effective, with both algorithms performing with minimal average loss during cross-validation. SNNs produced accuracy rates for ticks and mosquitoes of 91.22% and 94.46%, respectively, while accuracy rates of 93% and 99% were obtained with TNNs. Our results indicate that Deep Metric Learning is a practical machine learning tool for quickly and precisely classifying MALDI-TOF-generated mass spectra of Neotropical and public-health-relevant arthropod species.

深度学习技术在质谱分类中具有显著的优势,如并行信号校正和特征提取。深度度量学习模型将度量学习与深度学习的泛化能力相结合,以确定一组质谱之间的相似或差异程度,从而进一步改进特征提取。其中最流行的两种模型将具有相同架构的多个神经网络组合在一起,通常称为Siamese (SNN)和Triplet neural networks (TNN)。本文利用snn和tnn对库蚊科和伊蚊科30种新热带节肢动物的两组经验证的质谱进行了分类,其中一些是病媒动物。snn和tnn对12种蚊子的826种光谱和18种硬蜱的310种光谱的正确分类效果非常好,交叉验证时两种算法的平均损失都很小。snn对蜱和蚊的准确率分别为91.22%和94.46%,tnn对蜱和蚊的准确率分别为93%和99%。我们的结果表明,深度度量学习是一种实用的机器学习工具,可以快速准确地对maldi - tof生成的新热带和公共卫生相关节肢动物物种的质谱进行分类。
{"title":"Deep metric learning for the classification of MALDI-TOF spectral signatures from multiple species of neotropical disease vectors","authors":"Fernando Merchan ,&nbsp;Kenji Contreras ,&nbsp;Rolando A. Gittens ,&nbsp;Jose R. Loaiza ,&nbsp;Javier E. Sanchez-Galan","doi":"10.1016/j.ailsci.2023.100071","DOIUrl":"10.1016/j.ailsci.2023.100071","url":null,"abstract":"<div><p>Deep Learning techniques have significant advantages for mass spectral classification, such as parallelized signal correction and feature extraction. Deep Metric Learning models combine Metric Learning to determine the degree of similarity or difference between a set of mass spectra with the generalization power of Deep Learning to improve feature extraction even further. The two most popular of these models combine multiple neural networks with identical architectures and are commonly called Siamese (SNN) and Triplet Neural Networks (TNN). Herein, using both SNNs and TNNs, we intended to taxonomically categorize two sets of previously-validated mass spectra that corresponded to 30 species of Neotropical arthropods in the Culicidae and Ixodidae families, some of which are disease vectors. The effectiveness of SNNs and TNNs to correctly classify 826 spectra from 12 mosquito species and 310 spectra from 18 species of hard ticks was highly effective, with both algorithms performing with minimal average loss during cross-validation. SNNs produced accuracy rates for ticks and mosquitoes of 91.22% and 94.46%, respectively, while accuracy rates of 93% and 99% were obtained with TNNs. Our results indicate that Deep Metric Learning is a practical machine learning tool for quickly and precisely classifying MALDI-TOF-generated mass spectra of Neotropical and public-health-relevant arthropod species.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100071"},"PeriodicalIF":0.0,"publicationDate":"2023-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41748999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Conformal efficiency as a metric for comparative model assessment befitting federated learning 适形效率作为适合联邦学习的比较模型评估的度量
Pub Date : 2023-04-01 DOI: 10.1016/j.ailsci.2023.100070
Wouter Heyndrickx , Adam Arany , Jaak Simm , Anastasia Pentina , Noé Sturm , Lina Humbeck , Lewis Mervin , Adam Zalewski , Martijn Oldenhof , Peter Schmidtke , Lukas Friedrich , Regis Loeb , Arina Afanasyeva , Ansgar Schuffenhauer , Yves Moreau , Hugo Ceulemans

In a drug discovery setting, pharmaceutical companies own substantial but confidential datasets. The MELLODDY project developed a privacy-preserving federated machine learning solution and deployed it at an unprecedented scale. Each partner built models for their own private assays that benefitted from a shared representation. Established predictive performance metrics such as AUC ROC or AUC PR are constrained to unseen labeled chemical space and cannot gage performance gains in unlabeled chemical space. Federated learning indirectly extends labeled space, but in a privacy-preserving context, a partner cannot use this label extension for performance assessment. Metrics that estimate uncertainty on a prediction can be calculated even where no label is known. Practically, the chemical space covered with predictions above an uncertainty threshold, reflects the applicability domain of a model. After establishing a link to established performance metrics, we propose the efficiency from the conformal prediction framework (‘conformal efficiency’) as a proxy to the applicability domain size. A documented extension of the applicability domain would qualify as a tangible benefit from federated learning. In interim assessments, MELLODDY partners reported a median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with increases up to 9.7%). Subject to distributional conditions, that efficiency increase can be directly interpreted as the expected increase in conformal i.e. low uncertainty predictions. In conclusion, we present the first indication that privacy-preserving federated machine learning across massive drug-discovery datasets from ten pharma partners indeed extends the applicability domain of property prediction models.

在药物研发环境中,制药公司拥有大量但保密的数据集。MELLODDY项目开发了一种保护隐私的联邦机器学习解决方案,并以前所未有的规模进行了部署。每个合作伙伴都为自己的私人分析建立了模型,这些模型受益于共享的表示。已建立的预测性能指标(如AUC ROC或AUC PR)仅限于未见标记的化学空间,无法衡量未标记的化学空间中的性能增益。联邦学习间接地扩展了标记空间,但是在保护隐私的上下文中,合作伙伴不能使用这个标签扩展进行性能评估。即使在没有已知标签的情况下,也可以计算出估计预测不确定性的度量。实际上,化学空间覆盖着超过不确定性阈值的预测,反映了模型的适用范围。在建立了与已建立的性能指标的联系之后,我们提出了共形预测框架的效率(“共形效率”)作为适用领域大小的代理。适用性领域的文档化扩展将符合联邦学习的实际好处。在中期评估中,MELLODDY合作伙伴报告联合的适形效率中位数比单一合作伙伴模型提高了5.5%(最高可达9.7%)。根据分布条件,效率的提高可以直接解释为保形预测(即低不确定性预测)的预期增加。总之,我们提出了第一个迹象,表明来自十个制药合作伙伴的大规模药物发现数据集的隐私保护联合机器学习确实扩展了属性预测模型的适用范围。
{"title":"Conformal efficiency as a metric for comparative model assessment befitting federated learning","authors":"Wouter Heyndrickx ,&nbsp;Adam Arany ,&nbsp;Jaak Simm ,&nbsp;Anastasia Pentina ,&nbsp;Noé Sturm ,&nbsp;Lina Humbeck ,&nbsp;Lewis Mervin ,&nbsp;Adam Zalewski ,&nbsp;Martijn Oldenhof ,&nbsp;Peter Schmidtke ,&nbsp;Lukas Friedrich ,&nbsp;Regis Loeb ,&nbsp;Arina Afanasyeva ,&nbsp;Ansgar Schuffenhauer ,&nbsp;Yves Moreau ,&nbsp;Hugo Ceulemans","doi":"10.1016/j.ailsci.2023.100070","DOIUrl":"10.1016/j.ailsci.2023.100070","url":null,"abstract":"<div><p>In a drug discovery setting, pharmaceutical companies own substantial but confidential datasets. The MELLODDY project developed a privacy-preserving federated machine learning solution and deployed it at an unprecedented scale. Each partner built models for their own private assays that benefitted from a shared representation. Established predictive performance metrics such as AUC ROC or AUC PR are constrained to unseen labeled chemical space and cannot gage performance gains in unlabeled chemical space. Federated learning indirectly extends labeled space, but in a privacy-preserving context, a partner cannot use this label extension for performance assessment. Metrics that estimate uncertainty on a prediction can be calculated even where no label is known. Practically, the chemical space covered with predictions above an uncertainty threshold, reflects the applicability domain of a model. After establishing a link to established performance metrics, we propose the efficiency from the conformal prediction framework (‘conformal efficiency’) as a proxy to the applicability domain size. A documented extension of the applicability domain would qualify as a tangible benefit from federated learning. In interim assessments, MELLODDY partners reported a median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with increases up to 9.7%). Subject to distributional conditions, that efficiency increase can be directly interpreted as the expected increase in conformal i.e. low uncertainty predictions. In conclusion, we present the first indication that privacy-preserving federated machine learning across massive drug-discovery datasets from ten pharma partners indeed extends the applicability domain of property prediction models.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100070"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42954871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Pharmaceutical patent landscaping: A novel approach to understand patents from the drug discovery perspective 药物专利景观:一种从药物发现角度理解专利的新方法
Pub Date : 2023-03-31 DOI: 10.1016/j.ailsci.2023.100069
Yojana Gadiya , Philip Gribbon , Martin Hofmann-Apitius , Andrea Zaliani

Patents play a crucial role in the drug discovery process by providing legal protection for discoveries and incentivising investments in research and development. By identifying patterns within patent data resources, researchers can gain insight into the market trends and priorities of the pharmaceutical and biotechnology industries, as well as provide additional perspectives on more fundamental aspects such as the emergence of potential new drug targets. In this paper, we used the patent enrichment tool, PEMT, to extract, integrate, and analyse patent literature for rare diseases (RD) and Alzheimer's disease (AD). This is followed by a systematic review of the underlying patent landscape to decipher trends and applications in patents for these diseases. To do so, we discuss prominent organisations involved in drug discovery research in AD and RD. This allows us to gain an understanding of the importance of AD and RD from specific organisational (pharmaceutical or university) perspectives. Next, we analyse the historical focus of patents in relation to individual therapeutic targets and correlate them with market scenarios allowing the identification of prominent targets for a disease. Lastly, we identified drug repurposing activities within the two diseases with the help of patents. This resulted in identifying existing repurposed drugs and novel potential therapeutic approaches applicable to the indication areas. The study demonstrates the expanded applicability of patent documents from legal to drug discovery, design, and research, thus, providing a valuable resource for future drug discovery efforts. Moreover, this study is an attempt towards understanding the importance of data underlying patent documents and raising the need for preparing the data for machine learning-based applications.

专利通过为发现提供法律保护和激励研发投资,在药物发现过程中发挥着至关重要的作用。通过识别专利数据资源中的模式,研究人员可以深入了解制药和生物技术行业的市场趋势和优先事项,并对潜在新药靶点的出现等更基本的方面提供更多的视角。在本文中,我们使用专利富集工具PEMT来提取、整合和分析罕见病(RD)和阿尔茨海默病(AD)的专利文献。接下来是对潜在专利前景的系统审查,以解读这些疾病专利的趋势和应用。为此,我们讨论了参与AD和RD药物发现研究的知名组织。这使我们能够从特定的组织(制药或大学)角度了解AD和RD的重要性。接下来,我们分析了专利与个体治疗靶点相关的历史焦点,并将其与市场情景相关联,从而确定疾病的突出靶点。最后,我们在专利的帮助下确定了这两种疾病中的药物再利用活动。这导致确定了适用于适应症领域的现有再利用药物和新的潜在治疗方法。该研究表明,专利文件的适用性从法律扩展到药物发现、设计和研究,从而为未来的药物发现工作提供了宝贵的资源。此外,这项研究试图理解专利文件中数据的重要性,并提出为基于机器学习的应用准备数据的必要性。
{"title":"Pharmaceutical patent landscaping: A novel approach to understand patents from the drug discovery perspective","authors":"Yojana Gadiya ,&nbsp;Philip Gribbon ,&nbsp;Martin Hofmann-Apitius ,&nbsp;Andrea Zaliani","doi":"10.1016/j.ailsci.2023.100069","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100069","url":null,"abstract":"<div><p>Patents play a crucial role in the drug discovery process by providing legal protection for discoveries and incentivising investments in research and development. By identifying patterns within patent data resources, researchers can gain insight into the market trends and priorities of the pharmaceutical and biotechnology industries, as well as provide additional perspectives on more fundamental aspects such as the emergence of potential new drug targets. In this paper, we used the patent enrichment tool, PEMT, to extract, integrate, and analyse patent literature for rare diseases (RD) and Alzheimer's disease (AD). This is followed by a systematic review of the underlying patent landscape to decipher trends and applications in patents for these diseases. To do so, we discuss prominent organisations involved in drug discovery research in AD and RD. This allows us to gain an understanding of the importance of AD and RD from specific organisational (pharmaceutical or university) perspectives. Next, we analyse the historical focus of patents in relation to individual therapeutic targets and correlate them with market scenarios allowing the identification of prominent targets for a disease. Lastly, we identified drug repurposing activities within the two diseases with the help of patents. This resulted in identifying existing repurposed drugs and novel potential therapeutic approaches applicable to the indication areas. The study demonstrates the expanded applicability of patent documents from legal to drug discovery, design, and research, thus, providing a valuable resource for future drug discovery efforts. Moreover, this study is an attempt towards understanding the importance of data underlying patent documents and raising the need for preparing the data for machine learning-based applications.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100069"},"PeriodicalIF":0.0,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49774974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Elucidating dynamic cell lineages and gene networks in time-course single cell differentiation 阐明单细胞分化过程中的动态细胞系和基因网络
Pub Date : 2023-03-25 DOI: 10.1016/j.ailsci.2023.100068
Mengrui Zhang , Yongkai Chen , Dingyi Yu , Wenxuan Zhong , Jingyi Zhang , Ping Ma

Single cell RNA sequencing (scRNA-seq) technologies provide researchers with an unprecedented opportunity to exploit cell heterogeneity. For example, the sequenced cells belong to various cell lineages, which may have different cell fates in stem and progenitor cells. Those cells may differentiate into various mature cell types in a cell differentiation process. To trace the behavior of cell differentiation, researchers reconstruct cell lineages and predict cell fates by ordering cells chronologically into a trajectory with a pseudo-time. However, in scRNA-seq experiments, there are no cell-to-cell correspondences along with the time to reconstruct the cell lineages, which creates a significant challenge for cell lineage tracing and cell fate prediction. Therefore, methods that can accurately reconstruct the dynamic cell lineages and predict cell fates are highly desirable.

In this article, we develop an innovative machine-learning framework called Cell Smoothing Transformation (CellST) to elucidate the dynamic cell fate paths and construct gene networks in cell differentiation processes. Unlike the existing methods that construct one single bulk cell trajectory, CellST builds cell trajectories and tracks behaviors for each individual cell. Additionally, CellST can predict cell fates even for less frequent cell types. Based on the individual cell fate trajectories, CellST can further construct dynamic gene networks to model gene-gene relationships along the cell differentiation process and discover critical genes that potentially regulate cells into various mature cell types.

单细胞RNA测序(scRNA-seq)技术为研究人员利用细胞异质性提供了前所未有的机会。例如,测序的细胞属于不同的细胞谱系,在干细胞和祖细胞中可能具有不同的细胞命运。这些细胞可以在细胞分化过程中分化为各种成熟细胞类型。为了追踪细胞分化的行为,研究人员重建细胞谱系,并通过将细胞按时间顺序排列成具有伪时间的轨迹来预测细胞命运。然而,在scRNA-seq实验中,随着重建细胞谱系的时间,没有细胞与细胞的对应关系,这给细胞谱系追踪和细胞命运预测带来了重大挑战。因此,能够准确重建动态细胞谱系并预测细胞命运的方法是非常理想的。在这篇文章中,我们开发了一个名为细胞平滑转化(CellST)的创新机器学习框架,以阐明细胞分化过程中的动态细胞命运路径并构建基因网络。与构建单个大块细胞轨迹的现有方法不同,CellST构建细胞轨迹并跟踪每个单个细胞的行为。此外,CellST甚至可以预测频率较低的细胞类型的细胞命运。基于单个细胞的命运轨迹,CellST可以进一步构建动态基因网络,以模拟细胞分化过程中的基因-基因关系,并发现可能将细胞调节为各种成熟细胞类型的关键基因。
{"title":"Elucidating dynamic cell lineages and gene networks in time-course single cell differentiation","authors":"Mengrui Zhang ,&nbsp;Yongkai Chen ,&nbsp;Dingyi Yu ,&nbsp;Wenxuan Zhong ,&nbsp;Jingyi Zhang ,&nbsp;Ping Ma","doi":"10.1016/j.ailsci.2023.100068","DOIUrl":"10.1016/j.ailsci.2023.100068","url":null,"abstract":"<div><p>Single cell RNA sequencing (scRNA-seq) technologies provide researchers with an unprecedented opportunity to exploit cell heterogeneity. For example, the sequenced cells belong to various cell lineages, which may have different cell fates in stem and progenitor cells. Those cells may differentiate into various mature cell types in a cell differentiation process. To trace the behavior of cell differentiation, researchers reconstruct cell lineages and predict cell fates by ordering cells chronologically into a trajectory with a pseudo-time. However, in scRNA-seq experiments, there are no cell-to-cell correspondences along with the time to reconstruct the cell lineages, which creates a significant challenge for cell lineage tracing and cell fate prediction. Therefore, methods that can accurately reconstruct the dynamic cell lineages and predict cell fates are highly desirable.</p><p>In this article, we develop an innovative machine-learning framework called Cell Smoothing Transformation (CellST) to elucidate the dynamic cell fate paths and construct gene networks in cell differentiation processes. Unlike the existing methods that construct one single bulk cell trajectory, CellST builds cell trajectories and tracks behaviors for each individual cell. Additionally, CellST can predict cell fates even for less frequent cell types. Based on the individual cell fate trajectories, CellST can further construct dynamic gene networks to model gene-gene relationships along the cell differentiation process and discover critical genes that potentially regulate cells into various mature cell types.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100068"},"PeriodicalIF":0.0,"publicationDate":"2023-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10328540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9800573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data science and data analytics in life science research 生命科学研究中的数据科学和数据分析
Pub Date : 2023-02-27 DOI: 10.1016/j.ailsci.2023.100067
Jürgen Bajorath
{"title":"Data science and data analytics in life science research","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2023.100067","DOIUrl":"10.1016/j.ailsci.2023.100067","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100067"},"PeriodicalIF":0.0,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43783253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Natural products subsets: Generation and characterization 天然产物子集:生成和表征
Pub Date : 2023-02-26 DOI: 10.1016/j.ailsci.2023.100066
Ana L. Chávez-Hernández, José L. Medina-Franco

Natural products are attractive for drug discovery applications because of their distinctive chemical structures, such as an overall large fraction of sp3 carbon atoms, chiral centers (both features associated with structural complexity), large chemical scaffolds, and diversity of functional groups. Furthermore, natural products are used in de novo design and have inspired the development of pseudo-natural products using generative models. Public databases such as the Collection of Open NatUral ProdUcTs and the Universal Natural Product database (UNPD) are rich sources of structures to be used in generative models and other applications. In this work, we report the selection and characterization of the most diverse compounds of natural products from the UNPD using the MaxMin algorithm. The subsets generated with 14,994, 7,497, and 4,998 compounds are publicly available at https://github.com/DIFACQUIM/Natural-products-subsets-generation. We anticipate that the subsets will be particularly useful in building generative models based on natural products by research groups, particularly those with limited access to extensive supercomputer resources.

天然产物具有独特的化学结构,如大量的sp3碳原子、手性中心(这两个特征都与结构复杂性有关)、大型化学支架和功能基团的多样性,因此对药物发现应用具有吸引力。此外,天然产品被用于从头设计,并激发了使用生成模型的伪天然产品的发展。公共数据库,如开放天然产物集和通用天然产物数据库(UNPD)是生成模型和其他应用中使用的结构的丰富来源。在这项工作中,我们报告了使用MaxMin算法从UNPD中选择和表征最多样化的天然产物化合物。由14,994、7,497和4,998个化合物生成的子集可在https://github.com/DIFACQUIM/Natural-products-subsets-generation上公开获得。我们预计,这些子集将在研究小组建立基于自然产物的生成模型时特别有用,特别是那些无法获得大量超级计算机资源的研究小组。
{"title":"Natural products subsets: Generation and characterization","authors":"Ana L. Chávez-Hernández,&nbsp;José L. Medina-Franco","doi":"10.1016/j.ailsci.2023.100066","DOIUrl":"10.1016/j.ailsci.2023.100066","url":null,"abstract":"<div><p>Natural products are attractive for drug discovery applications because of their distinctive chemical structures, such as an overall large fraction of sp<sup>3</sup> carbon atoms, chiral centers (both features associated with structural complexity), large chemical scaffolds, and diversity of functional groups. Furthermore, natural products are used in <em>de novo</em> design and have inspired the development of pseudo-natural products using generative models. Public databases such as the Collection of Open NatUral ProdUcTs and the Universal Natural Product database (UNPD) are rich sources of structures to be used in generative models and other applications. In this work, we report the selection and characterization of the most diverse compounds of natural products from the UNPD using the MaxMin algorithm. The subsets generated with 14,994, 7,497, and 4,998 compounds are publicly available at <span>https://github.com/DIFACQUIM/Natural-products-subsets-generation</span><svg><path></path></svg>. We anticipate that the subsets will be particularly useful in building generative models based on natural products by research groups, particularly those with limited access to extensive supercomputer resources.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100066"},"PeriodicalIF":0.0,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43292936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An improved 3D quantitative structure-activity relationships (QSAR) of molecules with CNN-based partial least squares model 基于CNN的偏最小二乘模型改进分子三维定量构效关系
Pub Date : 2023-02-24 DOI: 10.1016/j.ailsci.2023.100065
Xuxiang Huo , Jun Xu , Mingyuan Xu , Hongming Chen

Ligand-based virtual screening plays an important role for cases in which protein structures are not available. Among ligand-based methods, accurate and fast prediction of protein-ligand binding affinity is crucial for reducing computational cost and exploring the chemical search space efficiently. Here we proposed a CNN-based method, termed as L3D-PLS for building the quantitative structure-activity relationships without target structures. In L3D-PLS, a CNN module was designed for extracting the key interaction features from the grids around aligned ligands, and a partial least square (PLS) model fits the binding affinity with the extracted features of the pre-trained CNN module. In 30 publicly available pre-aligned molecular datasets, L3D-PLS outperformed the traditional CoMFA method. This results highlight that L3D-PLS can be useful for lead optimization based on small datasets which is often true in drug discovery compaign.

基于配体的虚拟筛选在蛋白质结构不可用的情况下起着重要作用。在基于配体的方法中,准确、快速地预测蛋白质与配体的结合亲和力对于降低计算成本和有效地探索化学搜索空间至关重要。在这里,我们提出了一种基于cnn的方法,称为L3D-PLS,用于在没有目标结构的情况下建立定量的构效关系。在L3D-PLS中,设计了一个CNN模块,用于从对齐配体周围的网格中提取关键的相互作用特征,并使用偏最小二乘(PLS)模型将其与预训练CNN模块提取的特征进行拟合。在30个公开的预对齐分子数据集中,L3D-PLS优于传统的CoMFA方法。这一结果突出表明,L3D-PLS可以用于基于小数据集的先导物优化,这在药物发现过程中通常是正确的。
{"title":"An improved 3D quantitative structure-activity relationships (QSAR) of molecules with CNN-based partial least squares model","authors":"Xuxiang Huo ,&nbsp;Jun Xu ,&nbsp;Mingyuan Xu ,&nbsp;Hongming Chen","doi":"10.1016/j.ailsci.2023.100065","DOIUrl":"10.1016/j.ailsci.2023.100065","url":null,"abstract":"<div><p>Ligand-based virtual screening plays an important role for cases in which protein structures are not available. Among ligand-based methods, accurate and fast prediction of protein-ligand binding affinity is crucial for reducing computational cost and exploring the chemical search space efficiently. Here we proposed a CNN-based method, termed as L3D-PLS for building the quantitative structure-activity relationships without target structures. In L3D-PLS, a CNN module was designed for extracting the key interaction features from the grids around aligned ligands, and a partial least square (PLS) model fits the binding affinity with the extracted features of the pre-trained CNN module. In 30 publicly available pre-aligned molecular datasets, L3D-PLS outperformed the traditional CoMFA method. This results highlight that L3D-PLS can be useful for lead optimization based on small datasets which is often true in drug discovery compaign.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100065"},"PeriodicalIF":0.0,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46036629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Combining molecular and cell painting image data for mechanism of action prediction 结合分子和细胞绘画图像数据进行作用机理预测
Pub Date : 2023-02-17 DOI: 10.1016/j.ailsci.2023.100060
Guangyan Tian , Philip J Harrison , Akshai P Sreenivasan , Jordi Carreras-Puigvert , Ola Spjuth

The mechanism of action (MoA) of a compound describes the biological interaction through which it produces a pharmacological effect. Multiple data sources can be used for the purpose of predicting MoA, including compound structural information, and various assays, such as those based on cell morphology, transcriptomics and metabolomics. In the present study we explored the benefits and potential additive/synergistic effects of combining structural information, in the form of Morgan fingerprints, and morphological information, in the form of five-channel Cell Painting image data. For a set of 10 well represented MoA classes, we compared the performance of deep learning models trained on the two datasets separately versus a model trained on both datasets simultaneously. On a held-out test set we obtained a macro-averaged F1 score of 0.58 when training on only the structural data, 0.81 when training on only the image data, and 0.92 when training on both together. Thus indicating clear additive/synergistic effects and highlighting the benefit of integrating multiple data sources for MoA prediction.

化合物的作用机制(MoA)描述了其产生药理作用的生物相互作用。多种数据源可用于预测MoA,包括化合物结构信息和各种测定,例如基于细胞形态、转录组学和代谢组学的测定。在本研究中,我们探讨了将Morgan指纹形式的结构信息和五通道细胞绘画图像数据形式的形态信息相结合的好处和潜在的相加/协同效应。对于一组10个代表性很好的MoA类,我们比较了分别在两个数据集上训练的深度学习模型与同时在这两个数据集中训练的模型的性能。在一个保留的测试集上,当仅在结构数据上训练时,我们获得了0.58的宏观平均F1分数,当仅对图像数据进行训练时,获得了0.81的宏观平均分数,当同时对两者进行训练时获得了0.92的宏观平均分。因此,表明了明显的相加/协同效应,并强调了整合多个数据源进行MoA预测的好处。
{"title":"Combining molecular and cell painting image data for mechanism of action prediction","authors":"Guangyan Tian ,&nbsp;Philip J Harrison ,&nbsp;Akshai P Sreenivasan ,&nbsp;Jordi Carreras-Puigvert ,&nbsp;Ola Spjuth","doi":"10.1016/j.ailsci.2023.100060","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100060","url":null,"abstract":"<div><p>The mechanism of action (MoA) of a compound describes the biological interaction through which it produces a pharmacological effect. Multiple data sources can be used for the purpose of predicting MoA, including compound structural information, and various assays, such as those based on cell morphology, transcriptomics and metabolomics. In the present study we explored the benefits and potential additive/synergistic effects of combining structural information, in the form of Morgan fingerprints, and morphological information, in the form of five-channel Cell Painting image data. For a set of 10 well represented MoA classes, we compared the performance of deep learning models trained on the two datasets separately versus a model trained on both datasets simultaneously. On a held-out test set we obtained a macro-averaged F1 score of 0.58 when training on only the structural data, 0.81 when training on only the image data, and 0.92 when training on both together. Thus indicating clear additive/synergistic effects and highlighting the benefit of integrating multiple data sources for MoA prediction.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100060"},"PeriodicalIF":0.0,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49774973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI4DR: Development and implementation of an annotation system for high-throughput dose-response experiments AI4DR:高通量剂量反应实验注释系统的开发与实现
Pub Date : 2023-02-06 DOI: 10.1016/j.ailsci.2023.100063
Marc Bianciotto , Lionel Colliandre , Kun Mi , Isabelle Schreiber , Cécile Delorme , Stéphanie Vougier , Hervé Minoux

One of the common strategies to identify novel chemical matter in drug discovery consists in performing a High Throughput Screening (HTS). However, the large amount of data generated at the dose-response (DR) step of an HTS campaign requires a careful analysis to detect artifacts and correct erroneous datapoints before validating the experiments. This step which requires to review each DR experiment can be time consuming and prone to human errors or inconsistencies. AI4DR is a system that has been developed for the classification of DR curves based on a Convolutional Neural Network (CNN) acting on normalized images of the DR curves. AI4DR allows the annotation in minutes of thousands of curves among 14 categories to help the High Throughput Screening biologists in their analyses. Several categories are associated with active and inactive compounds, other categories correspond to features of interest such as the presence of noise, a weaker effect at high doses, or a suspiciously weak or strong slope at the inflexion point of the DR curves of actives. The classifier has been trained on an algorithmically generated dataset curated and refined by experts, tested using real screening campaigns and improved using thousands of annotations by experts. The solution is deployed using a MLFlow model server interfaced with the Genedata Screener data analysis software used by the end users. AI4DR improves the consistency, the robustness, and the speed of HTS data analysis as well as reducing the human effort to identify faster new medicines for patients.

在药物发现中识别新化学物质的常见策略之一是进行高通量筛选(HTS)。然而,在HTS活动的剂量-反应(DR)步骤中产生的大量数据需要在验证实验之前进行仔细分析,以检测伪影并纠正错误的数据点。这一步骤需要审查每个DR实验,可能非常耗时,而且容易出现人为错误或不一致。AI4DR是一种基于卷积神经网络(CNN)的DR曲线分类系统,它作用于DR曲线的归一化图像。AI4DR允许在几分钟内注释14个类别中的数千条曲线,以帮助高通量筛选生物学家进行分析。一些类别与活性和非活性化合物有关,其他类别对应于感兴趣的特征,例如噪声的存在,高剂量时较弱的效应,或活性物质DR曲线拐点处可疑的弱或强斜率。分类器已经在由专家策划和改进的算法生成的数据集上进行了训练,使用真实的筛选活动进行了测试,并使用专家的数千个注释进行了改进。该解决方案使用MLFlow模型服务器与最终用户使用的Genedata Screener数据分析软件进行部署。AI4DR提高了HTS数据分析的一致性、稳健性和速度,并减少了为更快地为患者识别新药而付出的人力。
{"title":"AI4DR: Development and implementation of an annotation system for high-throughput dose-response experiments","authors":"Marc Bianciotto ,&nbsp;Lionel Colliandre ,&nbsp;Kun Mi ,&nbsp;Isabelle Schreiber ,&nbsp;Cécile Delorme ,&nbsp;Stéphanie Vougier ,&nbsp;Hervé Minoux","doi":"10.1016/j.ailsci.2023.100063","DOIUrl":"10.1016/j.ailsci.2023.100063","url":null,"abstract":"<div><p>One of the common strategies to identify novel chemical matter in drug discovery consists in performing a High Throughput Screening (HTS). However, the large amount of data generated at the dose-response (DR) step of an HTS campaign requires a careful analysis to detect artifacts and correct erroneous datapoints before validating the experiments. This step which requires to review each DR experiment can be time consuming and prone to human errors or inconsistencies. AI4DR is a system that has been developed for the classification of DR curves based on a Convolutional Neural Network (CNN) acting on normalized images of the DR curves. AI4DR allows the annotation in minutes of thousands of curves among 14 categories to help the High Throughput Screening biologists in their analyses. Several categories are associated with active and inactive compounds, other categories correspond to features of interest such as the presence of noise, a weaker effect at high doses, or a suspiciously weak or strong slope at the inflexion point of the DR curves of actives. The classifier has been trained on an algorithmically generated dataset curated and refined by experts, tested using real screening campaigns and improved using thousands of annotations by experts. The solution is deployed using a MLFlow model server interfaced with the Genedata Screener data analysis software used by the end users. AI4DR improves the consistency, the robustness, and the speed of HTS data analysis as well as reducing the human effort to identify faster new medicines for patients.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100063"},"PeriodicalIF":0.0,"publicationDate":"2023-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45852417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring chemical space — Generative models and their evaluation 探索化学空间——生成模型及其评价
Pub Date : 2023-02-04 DOI: 10.1016/j.ailsci.2023.100064
Martin Vogt

Recent advances in the field of artificial intelligence, specifically regarding deep learning methods, have invigorated research into novel ways for the exploration of chemical space. Compared to more traditional methods that rely on chemical fragments and combinatorial recombination deep generative models generate molecules in a non-transparent way that defies easy rationalization. However, this opaque nature also promises to explore uncharted chemical space in novel ways that do not rely on structural similarity directly. These aspects and the complexity of training such models makes model assessment regarding novelty, uniqueness, and distribution of generated molecules a central aspect. This perspective gives an overview of current methodologies for chemical space exploration with an emphasis on deep neural network approaches. Key aspects of generative models include choice of molecular representation, the targeted chemical space, and the methodology for assessing and validating chemical space coverage.

人工智能领域的最新进展,特别是关于深度学习方法的进展,激发了对探索化学空间的新方法的研究。与依赖化学碎片和组合重组的更传统的方法相比,深度生成模型以一种不透明的方式生成分子,无法轻易合理化。然而,这种不透明的性质也有望以新颖的方式探索未知的化学空间,而不直接依赖于结构相似性。这些方面和训练这些模型的复杂性使得关于新颖性、唯一性和生成分子分布的模型评估成为一个中心方面。这一观点概述了当前化学空间探索的方法,重点是深度神经网络方法。生成模型的关键方面包括分子表示的选择,目标化学空间,以及评估和验证化学空间覆盖的方法。
{"title":"Exploring chemical space — Generative models and their evaluation","authors":"Martin Vogt","doi":"10.1016/j.ailsci.2023.100064","DOIUrl":"10.1016/j.ailsci.2023.100064","url":null,"abstract":"<div><p>Recent advances in the field of artificial intelligence, specifically regarding deep learning methods, have invigorated research into novel ways for the exploration of chemical space. Compared to more traditional methods that rely on chemical fragments and combinatorial recombination deep generative models generate molecules in a non-transparent way that defies easy rationalization. However, this opaque nature also promises to explore uncharted chemical space in novel ways that do not rely on structural similarity directly. These aspects and the complexity of training such models makes model assessment regarding novelty, uniqueness, and distribution of generated molecules a central aspect. This perspective gives an overview of current methodologies for chemical space exploration with an emphasis on deep neural network approaches. Key aspects of generative models include choice of molecular representation, the targeted chemical space, and the methodology for assessing and validating chemical space coverage.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"3 ","pages":"Article 100064"},"PeriodicalIF":0.0,"publicationDate":"2023-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48370934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Artificial intelligence in the life sciences
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1