High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.
{"title":"Statistical approaches enabling technology-specific assay interference prediction from large screening data sets","authors":"Vincenzo Palmacci , Steffen Hirte , Jorge Enrique Hernández González , Floriane Montanari , Johannes Kirchmair","doi":"10.1016/j.ailsci.2024.100099","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100099","url":null,"abstract":"<div><p>High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100099"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000060/pdfft?md5=b99d896dcc34d54ad38a7b8ccb52ebda&pid=1-s2.0-S2667318524000060-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1016/j.ailsci.2024.100098
Li Ju , Andreas Hellander , Ola Spjuth
Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.
{"title":"Federated learning for predicting compound mechanism of action based on image-data from cell painting","authors":"Li Ju , Andreas Hellander , Ola Spjuth","doi":"10.1016/j.ailsci.2024.100098","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100098","url":null,"abstract":"<div><p>Having access to sufficient data is essential in order to train accurate machine learning models, but much data is not publicly available. In drug discovery this is particularly evident, as much data is withheld at pharmaceutical companies for various reasons. Federated Learning (FL) aims at training a joint model between multiple parties but without disclosing data between the parties. In this work, we leverage Federated Learning to predict compound Mechanism of Action (MoA) using fluorescence image data from cell painting. Our study evaluates the effectiveness and efficiency of FL, comparing to non-collaborative and data-sharing collaborative learning in diverse scenarios. Specifically, we investigate the impact of data heterogeneity across participants on MoA prediction, an essential concern in real-life applications of FL, and demonstrate the benefits for all involved parties. This work highlights the potential of federated learning in multi-institutional collaborative machine learning for drug discovery and assessment of chemicals, offering a promising avenue to overcome data-sharing constraints.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100098"},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000059/pdfft?md5=100e1ed9ac27f95816db906647d11bc0&pid=1-s2.0-S2667318524000059-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140951069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-13DOI: 10.1016/j.ailsci.2024.100097
Yaroslav Chushak , Rebecca A. Clewell
A variety of environmental and physiological conditions can cause oxidative stress that damage cellular components such as DNA, proteins and lipids. Oxidative stress is implicated in many human diseases including cancer, cardiovascular diseases, neurological diseases, inflammatory diseases, and aging. The nuclear factor erythroid 2–related factor 2 (NRF2) is a transcriptional factor that plays a key role in the cellular antioxidant defense system as it regulates transcription of antioxidant proteins and detoxifying enzymes. There is an urgent need to identify novel compounds that activate NRF2 and enhance antioxidant defense. We collected data from the high-throughput screening of NRF2 activators and identified molecular fragments (structural alerts) associated with the activation of NRF2. We also developed ten classification models using different types of molecular descriptors and machine learning techniques. Two approaches were used to establish the applicability domain of developed models: the structure-based approach and the distance to model approach. The best performing model that used message passing neural network (MPNN) technique showed accuracy of 87 % for the test set of chemicals within the distance to model of 0.3. The integrative approach using a combination of generated structural alerts and MPNN model was used to screen approved drugs collected in the DrugBank to identify potential NRF2 activators. Out of 2393 screened chemicals 138 compounds were predicted as NRF2 activators by both approaches. Analysis of these compounds showed that some drugs were already known activators of NRF2 while others are potentially novel activators.
{"title":"An integrated approach to predict activators of NRF2 - the transcription factor for oxidative stress response","authors":"Yaroslav Chushak , Rebecca A. Clewell","doi":"10.1016/j.ailsci.2024.100097","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100097","url":null,"abstract":"<div><p>A variety of environmental and physiological conditions can cause oxidative stress that damage cellular components such as DNA, proteins and lipids. Oxidative stress is implicated in many human diseases including cancer, cardiovascular diseases, neurological diseases, inflammatory diseases, and aging. The nuclear factor erythroid 2–related factor 2 (NRF2) is a transcriptional factor that plays a key role in the cellular antioxidant defense system as it regulates transcription of antioxidant proteins and detoxifying enzymes. There is an urgent need to identify novel compounds that activate NRF2 and enhance antioxidant defense. We collected data from the high-throughput screening of NRF2 activators and identified molecular fragments (structural alerts) associated with the activation of NRF2. We also developed ten classification models using different types of molecular descriptors and machine learning techniques. Two approaches were used to establish the applicability domain of developed models: the structure-based approach and the distance to model approach. The best performing model that used message passing neural network (MPNN) technique showed accuracy of 87 % for the test set of chemicals within the distance to model of 0.3. The integrative approach using a combination of generated structural alerts and MPNN model was used to screen approved drugs collected in the DrugBank to identify potential NRF2 activators. Out of 2393 screened chemicals 138 compounds were predicted as NRF2 activators by both approaches. Analysis of these compounds showed that some drugs were already known activators of NRF2 while others are potentially novel activators.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100097"},"PeriodicalIF":0.0,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000047/pdfft?md5=29a2ee24a6813324417f266b95b1e48d&pid=1-s2.0-S2667318524000047-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140606623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21DOI: 10.1016/j.ailsci.2024.100096
Filip Miljković , José L. Medina-Franco
In chemoinformatics, artificial intelligence (AI) continues to grow a symbiosis with open science (OS). Such a close AI-OS interaction brings substantial practical benefits in research, scientific dissemination, and education, to name a few areas. The AI-OS symbiosis can be further enhanced by combining sufficient substantive expertise, mathematical and statistical knowledge, and coding skills. This Viewpoint discusses the benefits of the smooth and productive interaction between AI, OS, and open data. We also present a short list of misconceptions and pitfalls surrounding AI-OS and propose correct responses and behaviors agreed upon by field experts. In addition, we provide suggestions to continue enhancing the positive contributions of the AI-OS symbiosis towards chemoinformatics.
{"title":"Artificial intelligence-open science symbiosis in chemoinformatics","authors":"Filip Miljković , José L. Medina-Franco","doi":"10.1016/j.ailsci.2024.100096","DOIUrl":"10.1016/j.ailsci.2024.100096","url":null,"abstract":"<div><p>In chemoinformatics, artificial intelligence (AI) continues to grow a symbiosis with open science (OS). Such a close AI-OS interaction brings substantial practical benefits in research, scientific dissemination, and education, to name a few areas. The AI-OS symbiosis can be further enhanced by combining sufficient substantive expertise, mathematical and statistical knowledge, and coding skills. This Viewpoint discusses the benefits of the smooth and productive interaction between AI, OS, and open data. We also present a short list of misconceptions and pitfalls surrounding AI-OS and propose correct responses and behaviors agreed upon by field experts. In addition, we provide suggestions to continue enhancing the positive contributions of the AI-OS symbiosis towards chemoinformatics.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100096"},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000035/pdfft?md5=15b234d142847a979a68f7886068152e&pid=1-s2.0-S2667318524000035-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140276452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.1016/j.ailsci.2024.100095
Negin Sadat Babaiha , Sathvik Guru Rao , Jürgen Klein , Bruce Schultz , Marc Jacobs , Martin Hofmann-Apitius
Biomedical knowledge graphs (KGs) hold valuable information regarding biomedical entities such as genes, diseases, biological processes, and drugs. KGs have been successfully employed in challenging biomedical areas such as the identification of pathophysiology mechanisms or drug repurposing. The creation of high-quality KGs typically requires labor-intensive multi-database integration or substantial human expert curation, both of which take time and contribute to the workload of data processing and annotation. Therefore, the use of automatic systems for KG building and maintenance is a prerequisite for the wide uptake and utilization of KGs. Technologies supporting the automated generation and updating of KGs typically make use of Natural Language Processing (NLP), which is optimized for extracting implicit triples described in relevant biomedical text sources. At the core of this challenge is how to improve the accuracy and coverage of the information extraction module by utilizing different models and tools. The emergence of pre-trained large language models (LLMs), such as ChatGPT which has grown in popularity dramatically, has revolutionized the field of NLP, making them a potential candidate to be used in text-based graph creation as well. So far, no previous work has investigated the power of LLMs on the generation of cause-and-effect networks and KGs encoded in Biological Expression Language (BEL). In this paper, we present initial studies towards one-shot BEL relation extraction using two different versions of the Generative Pre-trained Transformer (GPT) models and evaluate its performance by comparing the extracted results to a highly accurate, manually curated BEL KG curated by domain experts.
生物医学知识图谱(KG)包含有关基因、疾病、生物过程和药物等生物医学实体的宝贵信息。知识图谱已成功应用于具有挑战性的生物医学领域,如病理生理学机制鉴定或药物再利用。创建高质量的 KG 通常需要劳动密集型的多数据库整合或大量的人工专家策划,这两者都需要时间,并增加了数据处理和注释的工作量。因此,使用自动系统建立和维护 KG 是广泛吸收和利用 KG 的先决条件。支持自动生成和更新 KG 的技术通常使用自然语言处理(NLP)技术,该技术针对提取相关生物医学文本资源中描述的隐式三元组进行了优化。这一挑战的核心是如何利用不同的模型和工具来提高信息提取模块的准确性和覆盖范围。预训练的大型语言模型(LLM)的出现,如 ChatGPT 的急剧普及,给 NLP 领域带来了革命性的变化,使其也有可能用于基于文本的图创建。迄今为止,还没有人研究过 LLM 在生成以生物表达语言(BEL)编码的因果网络和 KG 方面的威力。在本文中,我们介绍了使用两种不同版本的生成预训练转换器(GPT)模型进行一次 BEL 关系提取的初步研究,并通过将提取结果与领域专家手动策划的高精度 BEL KG 进行比较,评估了其性能。
{"title":"Rationalism in the face of GPT hypes: Benchmarking the output of large language models against human expert-curated biomedical knowledge graphs","authors":"Negin Sadat Babaiha , Sathvik Guru Rao , Jürgen Klein , Bruce Schultz , Marc Jacobs , Martin Hofmann-Apitius","doi":"10.1016/j.ailsci.2024.100095","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100095","url":null,"abstract":"<div><p>Biomedical knowledge graphs (KGs) hold valuable information regarding biomedical entities such as genes, diseases, biological processes, and drugs. KGs have been successfully employed in challenging biomedical areas such as the identification of pathophysiology mechanisms or drug repurposing. The creation of high-quality KGs typically requires labor-intensive multi-database integration or substantial human expert curation, both of which take time and contribute to the workload of data processing and annotation. Therefore, the use of automatic systems for KG building and maintenance is a prerequisite for the wide uptake and utilization of KGs. Technologies supporting the automated generation and updating of KGs typically make use of Natural Language Processing (NLP), which is optimized for extracting implicit triples described in relevant biomedical text sources. At the core of this challenge is how to improve the accuracy and coverage of the information extraction module by utilizing different models and tools. The emergence of pre-trained large language models (LLMs), such as ChatGPT which has grown in popularity dramatically, has revolutionized the field of NLP, making them a potential candidate to be used in text-based graph creation as well. So far, no previous work has investigated the power of LLMs on the generation of cause-and-effect networks and KGs encoded in Biological Expression Language (BEL). In this paper, we present initial studies towards one-shot BEL relation extraction using two different versions of the Generative Pre-trained Transformer (GPT) models and evaluate its performance by comparing the extracted results to a highly accurate, manually curated BEL KG curated by domain experts.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100095"},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000023/pdfft?md5=9137dd2a207653e4d13cb5b99ca17d48&pid=1-s2.0-S2667318524000023-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139710160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-03DOI: 10.1016/j.ailsci.2024.100094
Jürgen Bajorath
{"title":"Origins and progression of the polypharmacology concept in drug discovery","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2024.100094","DOIUrl":"https://doi.org/10.1016/j.ailsci.2024.100094","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100094"},"PeriodicalIF":0.0,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318524000011/pdfft?md5=ef2f5411ede3a24f3429765640c3360c&pid=1-s2.0-S2667318524000011-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139107191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-11DOI: 10.1016/j.ailsci.2023.100093
Jürgen Bajorath
{"title":"Potential inconsistencies or artifacts in deriving and interpreting deep learning models and key criteria for scientifically sound applications in the life sciences","authors":"Jürgen Bajorath","doi":"10.1016/j.ailsci.2023.100093","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100093","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100093"},"PeriodicalIF":0.0,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318523000375/pdfft?md5=889ff9050b182b6d486b269b3cf0eed4&pid=1-s2.0-S2667318523000375-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138656969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-02DOI: 10.1016/j.ailsci.2023.100089
Zhixiong Li, Yan Xiang, Yujing Wen, Daniel Reker
Active machine learning is an established and increasingly popular experimental design technique where the machine learning model can request additional data to improve the model's predictive performance. It is generally assumed that this data is optimal for the machine learning model since it relies on the model's predictions or model architecture and therefore cannot be transferred to other models. Inspired by research in pedagogy, we here introduce the concept of yoked machine learning where a second machine learning model learns from the data selected by another model. We found that in 48% of the benchmarked combinations, yoked learning performed similar or better than active learning. We analyze distinct cases in which yoked learning can improve active learning performance. In particular, we prototype yoked deep learning (YoDeL) where a classic machine learning model provides data to a deep neural network, thereby mitigating challenges of active deep learning such as slow refitting time per learning iteration and poor performance on small datasets. In summary, we expect the new concept of yoked (deep) learning to provide a competitive option to boost the performance of active learning and benefit from distinct capabilities of multiple machine learning models during data acquisition, training, and deployment.
{"title":"Yoked learning in molecular data science","authors":"Zhixiong Li, Yan Xiang, Yujing Wen, Daniel Reker","doi":"10.1016/j.ailsci.2023.100089","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100089","url":null,"abstract":"<div><p>Active machine learning is an established and increasingly popular experimental design technique where the machine learning model can request additional data to improve the model's predictive performance. It is generally assumed that this data is optimal for the machine learning model since it relies on the model's predictions or model architecture and therefore cannot be transferred to other models. Inspired by research in pedagogy, we here introduce the concept of yoked machine learning where a second machine learning model learns from the data selected by another model. We found that in 48% of the benchmarked combinations, yoked learning performed similar or better than active learning. We analyze distinct cases in which yoked learning can improve active learning performance. In particular, we prototype yoked deep learning (YoDeL) where a classic machine learning model provides data to a deep neural network, thereby mitigating challenges of active deep learning such as slow refitting time per learning iteration and poor performance on small datasets. In summary, we expect the new concept of yoked (deep) learning to provide a competitive option to boost the performance of active learning and benefit from distinct capabilities of multiple machine learning models during data acquisition, training, and deployment.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100089"},"PeriodicalIF":0.0,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318523000338/pdfft?md5=798e4cffb7539da96cce07297e51e3de&pid=1-s2.0-S2667318523000338-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138570365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-25DOI: 10.1016/j.ailsci.2023.100090
Linnea K. Andersen , Benjamin J. Reading
Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.
{"title":"A supervised machine learning workflow for the reduction of highly dimensional biological data","authors":"Linnea K. Andersen , Benjamin J. Reading","doi":"10.1016/j.ailsci.2023.100090","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100090","url":null,"abstract":"<div><p>Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100090"},"PeriodicalIF":0.0,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S266731852300034X/pdfft?md5=c41b31c74fb0a867fbb87db01c8f6190&pid=1-s2.0-S266731852300034X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138739061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-15DOI: 10.1016/j.ailsci.2023.100088
Jürgen Bajorath, Steve Gardner, Francesca Grisoni, Carolina Horta Andrade, Johannes Kirchmair, Melissa Landon, José L. Medina-Franco, Filip Miljković, Floriane Montantari, Raquel Rodríguez-Pérez
{"title":"First-generation themed article collections","authors":"Jürgen Bajorath, Steve Gardner, Francesca Grisoni, Carolina Horta Andrade, Johannes Kirchmair, Melissa Landon, José L. Medina-Franco, Filip Miljković, Floriane Montantari, Raquel Rodríguez-Pérez","doi":"10.1016/j.ailsci.2023.100088","DOIUrl":"https://doi.org/10.1016/j.ailsci.2023.100088","url":null,"abstract":"","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"4 ","pages":"Article 100088"},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667318523000326/pdfft?md5=cf39af5e517634e52f28e066ee1fd1da&pid=1-s2.0-S2667318523000326-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136697050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}