Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.

IF 10.1 1区环境科学与生态学 Q1 ENVIRONMENTAL SCIENCES Environmental Health Perspectives Pub Date : 2024-08-01 Epub Date: 2024-08-06 DOI:10.1289/EHP14001

Kamel Mansouri, Kyla Taylor, Scott Auerbach, Stephen Ferguson, Rachel Frawley, Jui-Hua Hsieh, Gloria Jahnke, Nicole Kleinstreuer, Suril Mehta, José T Moreira-Filho, Fred Parham, Cynthia Rider, Andrew A Rooney, Amy Wang, Vicki Sutherland

{"title":"Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.","authors":"Kamel Mansouri, Kyla Taylor, Scott Auerbach, Stephen Ferguson, Rachel Frawley, Jui-Hua Hsieh, Gloria Jahnke, Nicole Kleinstreuer, Suril Mehta, José T Moreira-Filho, Fred Parham, Cynthia Rider, Andrew A Rooney, Amy Wang, Vicki Sutherland","doi":"10.1289/EHP14001","DOIUrl":null,"url":null,"abstract":"Background: The field of toxicology has witnessed substantial advancements in recent years, particularly with the adoption of new approach methodologies (NAMs) to understand and predict chemical toxicity. Class-based methods such as clustering and classification are key to NAMs development and application, aiding the understanding of hazard and risk concerns associated with groups of chemicals without additional laboratory work. Advances in computational chemistry, data generation and availability, and machine learning algorithms represent important opportunities for continued improvement of these techniques to optimize their utility for specific regulatory and research purposes. However, due to their intricacy, deep understanding and careful selection are imperative to align the adequate methods with their intended applications.Objectives: This commentary aims to deepen the understanding of class-based approaches by elucidating the pivotal role of chemical similarity (structural and biological) in clustering and classification approaches (CCAs). It addresses the dichotomy between general end point-agnostic similarity, often entailing unsupervised analysis, and end point-specific similarity necessitating supervised learning. The goal is to highlight the nuances of these approaches, their applications, and common misuses.Discussion: Understanding similarity is pivotal in toxicological research involving CCAs. The effectiveness of these approaches depends on the right definition and measure of similarity, which varies based on context and objectives of the study. This choice is influenced by how chemical structures are represented and the respective labels indicating biological activity, if applicable. The distinction between unsupervised clustering and supervised classification methods is vital, requiring the use of end point-agnostic vs. end point-specific similarity definition. Separate use or combination of these methods requires careful consideration to prevent bias and ensure relevance for the goal of the study. Unsupervised methods use end point-agnostic similarity measures to uncover general structural patterns and relationships, aiding hypothesis generation and facilitating exploration of datasets without the need for predefined labels or explicit guidance. Conversely, supervised techniques demand end point-specific similarity to group chemicals into predefined classes or to train classification models, allowing accurate predictions for new chemicals. Misuse can arise when unsupervised methods are applied to end point-specific contexts, like analog selection in read-across, leading to erroneous conclusions. This commentary provides insights into the significance of similarity and its role in supervised classification and unsupervised clustering approaches. https://doi.org/10.1289/EHP14001.","PeriodicalId":11862,"journal":{"name":"Environmental Health Perspectives","volume":"132 8","pages":"85002"},"PeriodicalIF":10.1000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302584/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Health Perspectives","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1289/EHP14001","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/6 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The field of toxicology has witnessed substantial advancements in recent years, particularly with the adoption of new approach methodologies (NAMs) to understand and predict chemical toxicity. Class-based methods such as clustering and classification are key to NAMs development and application, aiding the understanding of hazard and risk concerns associated with groups of chemicals without additional laboratory work. Advances in computational chemistry, data generation and availability, and machine learning algorithms represent important opportunities for continued improvement of these techniques to optimize their utility for specific regulatory and research purposes. However, due to their intricacy, deep understanding and careful selection are imperative to align the adequate methods with their intended applications.

Objectives: This commentary aims to deepen the understanding of class-based approaches by elucidating the pivotal role of chemical similarity (structural and biological) in clustering and classification approaches (CCAs). It addresses the dichotomy between general end point-agnostic similarity, often entailing unsupervised analysis, and end point-specific similarity necessitating supervised learning. The goal is to highlight the nuances of these approaches, their applications, and common misuses.

Discussion: Understanding similarity is pivotal in toxicological research involving CCAs. The effectiveness of these approaches depends on the right definition and measure of similarity, which varies based on context and objectives of the study. This choice is influenced by how chemical structures are represented and the respective labels indicating biological activity, if applicable. The distinction between unsupervised clustering and supervised classification methods is vital, requiring the use of end point-agnostic vs. end point-specific similarity definition. Separate use or combination of these methods requires careful consideration to prevent bias and ensure relevance for the goal of the study. Unsupervised methods use end point-agnostic similarity measures to uncover general structural patterns and relationships, aiding hypothesis generation and facilitating exploration of datasets without the need for predefined labels or explicit guidance. Conversely, supervised techniques demand end point-specific similarity to group chemicals into predefined classes or to train classification models, allowing accurate predictions for new chemicals. Misuse can arise when unsupervised methods are applied to end point-specific contexts, like analog selection in read-across, leading to erroneous conclusions. This commentary provides insights into the significance of similarity and its role in supervised classification and unsupervised clustering approaches. https://doi.org/10.1289/EHP14001.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

释放聚类和分类方法的潜力：探索有监督和无监督的化学相似性。

背景：近年来，毒理学领域取得了长足的进步，尤其是采用新方法（NAMs）来了解和预测化学毒性。基于类别的方法（如聚类和分类）是新方法开发和应用的关键，有助于了解与化学品组相关的危害和风险问题，而无需额外的实验室工作。计算化学、数据生成和可用性以及机器学习算法方面的进步为这些技术的持续改进提供了重要机会，从而优化了它们在特定监管和研究目的中的实用性。然而，由于这些技术错综复杂，因此必须深入了解并谨慎选择，才能使适当的方法与其预期应用相匹配：本评论旨在通过阐明化学相似性（结构和生物）在聚类和分类方法（CCAs）中的关键作用，加深对基于类别的方法的理解。文章探讨了通常需要进行无监督分析的一般终点不可知相似性与需要进行监督学习的特定终点相似性之间的二分法。目的是强调这些方法的细微差别、应用和常见误用：讨论：了解相似性在涉及 CCA 的毒理学研究中至关重要。这些方法的有效性取决于对相似性的正确定义和衡量标准，而定义和衡量标准因研究的背景和目标而异。这种选择会受到化学结构的表示方法和相应的生物活性标签（如果适用）的影响。无监督聚类和有监督分类方法之间的区别至关重要，这要求使用与终点无关的相似性定义和与特定终点有关的相似性定义。单独使用或结合使用这些方法需要慎重考虑，以防止出现偏差，并确保与研究目标相关。无监督方法使用与端点无关的相似性度量来发现一般的结构模式和关系，有助于假设的生成，并促进数据集的探索，而无需预定义的标签或明确的指导。相反，有监督技术则需要特定端点的相似性，以便将化学品归入预定义类别或训练分类模型，从而对新化学品进行准确预测。如果将无监督方法应用于特定端点环境（如读取交叉中的类似物选择），则可能出现误用，导致错误结论。本评论深入探讨了相似性的意义及其在有监督分类和无监督聚类方法中的作用。https://doi.org/10.1289/EHP14001。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Environmental Health Perspectives 环境科学-公共卫生、环境卫生与职业卫生

CiteScore

14.40

自引率

2.90%

发文量

388

审稿时长

6 months

期刊介绍： Environmental Health Perspectives (EHP) is a monthly peer-reviewed journal supported by the National Institute of Environmental Health Sciences, part of the National Institutes of Health under the U.S. Department of Health and Human Services. Its mission is to facilitate discussions on the connections between the environment and human health by publishing top-notch research and news. EHP ranks third in Public, Environmental, and Occupational Health, fourth in Toxicology, and fifth in Environmental Sciences.