David H. Margarit, Gustavo Paccosi, Marcela V. Reale, Lilia M. Romanelli
This study presents an interdisciplinary approach to analyse the distribution of cancer stem cell markers (CSCMs) across various cancer-affected organs using hypergraphs. Cancer stem cells (CSCs) play a crucial role in cancer initiation, progression, and metastasis. By employing hypergraphs, we model the relationships between CSCM locations and cancerous organs, providing a comprehensive representation of these interactions. Initially, we utilised an unweighted incidence matrix and its Markov transition matrices to gain a dynamic perspective on CSCM distributions. This method allows us to observe how these markers spread and influence cancer progression in a dynamical context. By calculating mutual information for each node and hyperedge, our analysis uncovers complex interaction patterns between CSCMs and organs, highlighting the critical roles of certain markers in cancer progression and metastasis. Our approach offers a detailed representation of cancer stem cell networks, enhancing our understanding of the mechanisms driving cancer heterogeneity and metastasis. By integrating hypergraph theory with cancer biology, this study provides valuable insights for developing targeted cancer therapies.
{"title":"Mapping Cancer Stem Cell Markers Distribution:A Hypergraph Analysis Across Organs","authors":"David H. Margarit, Gustavo Paccosi, Marcela V. Reale, Lilia M. Romanelli","doi":"arxiv-2407.19330","DOIUrl":"https://doi.org/arxiv-2407.19330","url":null,"abstract":"This study presents an interdisciplinary approach to analyse the distribution\u0000of cancer stem cell markers (CSCMs) across various cancer-affected organs using\u0000hypergraphs. Cancer stem cells (CSCs) play a crucial role in cancer initiation,\u0000progression, and metastasis. By employing hypergraphs, we model the\u0000relationships between CSCM locations and cancerous organs, providing a\u0000comprehensive representation of these interactions. Initially, we utilised an\u0000unweighted incidence matrix and its Markov transition matrices to gain a\u0000dynamic perspective on CSCM distributions. This method allows us to observe how\u0000these markers spread and influence cancer progression in a dynamical context.\u0000By calculating mutual information for each node and hyperedge, our analysis\u0000uncovers complex interaction patterns between CSCMs and organs, highlighting\u0000the critical roles of certain markers in cancer progression and metastasis. Our\u0000approach offers a detailed representation of cancer stem cell networks,\u0000enhancing our understanding of the mechanisms driving cancer heterogeneity and\u0000metastasis. By integrating hypergraph theory with cancer biology, this study\u0000provides valuable insights for developing targeted cancer therapies.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
{"title":"Small Molecule Optimization with Large Language Models","authors":"Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan","doi":"arxiv-2407.18897","DOIUrl":"https://doi.org/arxiv-2407.18897","url":null,"abstract":"Recent advancements in large language models have opened new possibilities\u0000for generative molecular drug design. We present Chemlactica and Chemma, two\u0000language models fine-tuned on a novel corpus of 110M molecules with computed\u0000properties, totaling 40B tokens. These models demonstrate strong performance in\u0000generating molecules with specified properties and predicting new molecular\u0000characteristics from limited samples. We introduce a novel optimization\u0000algorithm that leverages our language models to optimize molecules for\u0000arbitrary properties given limited access to a black box oracle. Our approach\u0000combines ideas from genetic algorithms, rejection sampling, and prompt\u0000optimization. It achieves state-of-the-art performance on multiple molecular\u0000optimization benchmarks, including an 8% improvement on Practical Molecular\u0000Optimization compared to previous methods. We publicly release the training\u0000corpus, the language models and the optimization algorithm.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Methods for personalizing medical treatment are the focal point of contemporary biomedical research. In cancer care, we can analyze the effects of therapies at the level of individual cells. Quantitative characterization of treatment efficacy and evaluation of why some individuals respond to specific regimens, whereas others do not, requires additional approaches to genetic sequencing at single time points. Methods for the continuous analysis of changes in phenotype, such as in vivo and ex vivo morphology and motion tracking of cellular proteins and organelles, over time-frames spanning the minute-hour scales, can provide important insights into patient treatment options. Despite improvements in the diagnosis and therapy of many types of breast cancer (BC), many aggressive forms, such as receptor triple-negative cancers, are associated with the worst patient outcomes; though initially effective in reducing tumor burden for some patients, acquired resistance to cytotoxic chemotherapy is almost universal, and there is no rationale for identifying intrinsically drug-resistant and drug-sensitive patient populations before initiating therapy. During cell division, the receptor triple-negative MDA-MB-231 mitotic spindles are the largest in comparison to other BC cell lines. Many of the MDA-MB-231 spindles exhibit rapid lateral twisting during metaphase, which remains unaffected by knockdown of the oncogene Myc and treatment with inhibitors of the serine/threonine-protein kinase B-Raf and the epidermal growth factor receptor (EGFR), alone or in any combination. In this manuscript, we outline a strategy for the selection of the most optimal tubulin inhibitor based on the ability to affect MT dynamics.
个性化医疗方法是当代生物医学研究的焦点。在癌症治疗中,我们可以从单个细胞的层面分析治疗效果。要定量分析治疗效果,评估为什么有些人对特定的治疗方案有反应,而有些人则没有,这需要在单个时间点进行基因测序的基础上采取更多的方法。对表型变化进行连续分析的方法,如体内和体外形态学以及细胞蛋白质和细胞器的运动追踪,其时间跨度可达几分钟至几小时,可为患者的治疗选择提供重要启示。尽管许多类型乳腺癌(BC)的诊断和治疗都有所改进,但许多侵袭性乳腺癌(如受体三阴性癌)的患者预后最差;虽然最初能有效减轻一些患者的肿瘤负担,但后天对细胞毒化疗的耐药性几乎是普遍现象,而且没有理由在开始治疗前鉴别本质上耐药和对药物敏感的患者群体。在细胞分裂过程中,与其他 BC 细胞相比,受体三阴性的 MDA-MB-231 有丝分裂轴是最大的。MDA-MB-231的许多纺锤体在有丝分裂期表现出快速的横向扭转,这种扭转不受癌基因Myc基因敲除以及丝氨酸/苏氨酸蛋白激酶B-Raf和表皮生长因子受体(EGFR)抑制剂单独或联合使用的影响。在本手稿中,我们概述了根据影响MT动态的能力来选择最佳微管蛋白抑制剂的策略。
{"title":"Mitosis, Cytoskeleton Regulation, and Drug Resistance in Receptor Triple Negative Breast Cancer","authors":"Alexandre Matov","doi":"arxiv-2407.19112","DOIUrl":"https://doi.org/arxiv-2407.19112","url":null,"abstract":"Methods for personalizing medical treatment are the focal point of\u0000contemporary biomedical research. In cancer care, we can analyze the effects of\u0000therapies at the level of individual cells. Quantitative characterization of\u0000treatment efficacy and evaluation of why some individuals respond to specific\u0000regimens, whereas others do not, requires additional approaches to genetic\u0000sequencing at single time points. Methods for the continuous analysis of\u0000changes in phenotype, such as in vivo and ex vivo morphology and motion\u0000tracking of cellular proteins and organelles, over time-frames spanning the\u0000minute-hour scales, can provide important insights into patient treatment\u0000options. Despite improvements in the diagnosis and therapy of many types of breast\u0000cancer (BC), many aggressive forms, such as receptor triple-negative cancers,\u0000are associated with the worst patient outcomes; though initially effective in\u0000reducing tumor burden for some patients, acquired resistance to cytotoxic\u0000chemotherapy is almost universal, and there is no rationale for identifying\u0000intrinsically drug-resistant and drug-sensitive patient populations before\u0000initiating therapy. During cell division, the receptor triple-negative\u0000MDA-MB-231 mitotic spindles are the largest in comparison to other BC cell\u0000lines. Many of the MDA-MB-231 spindles exhibit rapid lateral twisting during\u0000metaphase, which remains unaffected by knockdown of the oncogene Myc and\u0000treatment with inhibitors of the serine/threonine-protein kinase B-Raf and the\u0000epidermal growth factor receptor (EGFR), alone or in any combination. In this manuscript, we outline a strategy for the selection of the most\u0000optimal tubulin inhibitor based on the ability to affect MT dynamics.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Burak Yelmen, Maris Alver, Estonian Biobank Research Team, Flora Jay, Lili Milani
Investigating the genetic architecture of complex diseases is challenging due to the highly polygenic and interactive landscape of genetic and environmental factors. Although genome-wide association studies (GWAS) have identified thousands of variants for multiple complex phenotypes, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis models. In this work, we trained artificial neural networks for predicting complex traits using both simulated and real genotype/phenotype datasets. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated loci (PAL) for the target phenotype. Simulations we performed with various parameters demonstrated that associated loci can be detected with good precision using strict selection criteria, but downstream analyses are required for fine-mapping the exact variants due to linkage disequilibrium, similarly to conventional GWAS. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we were able to detect multiple PAL related to this highly polygenic and heritable disorder. We also performed enrichment analyses with PAL in genic regions, which predominantly identified terms associated with brain morphology. With further improvements in model optimization and confidence measures, artificial neural networks can enhance the identification of genomic loci associated with complex diseases, providing a more comprehensive approach for GWAS and serving as initial screening tools for subsequent functional studies. Keywords: Deep learning, interpretability, genome-wide association studies, complex diseases
由于遗传和环境因素具有高度的多源性和交互性,调查复杂疾病的遗传结构具有挑战性。尽管全基因组关联研究(GWAS)已经确定了多种复杂表型的数千个变体,但传统的统计方法可能会受到简化假设的限制,如线性和缺乏表观模型。在这项工作中,我们使用模拟和真实的基因型/表型数据集训练了预测复杂性状的人工神经网络。我们通过不同的事后可解释性方法提取了特征重要性评分,以确定目标表型的潜在相关基因位点(PAL)。我们使用各种参数进行的模拟表明,使用严格的选择标准可以很精确地检测到相关基因座,但由于连锁不平衡,需要进行下游分析来精细绘制确切的变异株,这与传统的 GWAS 类似。通过将我们的方法应用于爱沙尼亚生物库中的精神分裂症队列,我们能够检测到与这种高度多基因遗传性疾病相关的多个 PAL。我们还对基因区域的 PAL 进行了富集分析,主要发现了与大脑形态相关的术语。随着模型优化和置信度测量的进一步改进,人工神经网络可以增强与复杂疾病相关的基因组位点的鉴定,为GWAS提供一种更全面的方法,并作为后续功能研究的初步筛选工具。关键词深度学习 可解释性 全基因组关联研究 复杂疾病
{"title":"Interpreting artificial neural networks to detect genome-wide association signals for complex traits","authors":"Burak Yelmen, Maris Alver, Estonian Biobank Research Team, Flora Jay, Lili Milani","doi":"arxiv-2407.18811","DOIUrl":"https://doi.org/arxiv-2407.18811","url":null,"abstract":"Investigating the genetic architecture of complex diseases is challenging due\u0000to the highly polygenic and interactive landscape of genetic and environmental\u0000factors. Although genome-wide association studies (GWAS) have identified\u0000thousands of variants for multiple complex phenotypes, conventional statistical\u0000approaches can be limited by simplified assumptions such as linearity and lack\u0000of epistasis models. In this work, we trained artificial neural networks for\u0000predicting complex traits using both simulated and real genotype/phenotype\u0000datasets. We extracted feature importance scores via different post hoc\u0000interpretability methods to identify potentially associated loci (PAL) for the\u0000target phenotype. Simulations we performed with various parameters demonstrated\u0000that associated loci can be detected with good precision using strict selection\u0000criteria, but downstream analyses are required for fine-mapping the exact\u0000variants due to linkage disequilibrium, similarly to conventional GWAS. By\u0000applying our approach to the schizophrenia cohort in the Estonian Biobank, we\u0000were able to detect multiple PAL related to this highly polygenic and heritable\u0000disorder. We also performed enrichment analyses with PAL in genic regions,\u0000which predominantly identified terms associated with brain morphology. With\u0000further improvements in model optimization and confidence measures, artificial\u0000neural networks can enhance the identification of genomic loci associated with\u0000complex diseases, providing a more comprehensive approach for GWAS and serving\u0000as initial screening tools for subsequent functional studies. Keywords: Deep learning, interpretability, genome-wide association studies,\u0000complex diseases","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"213 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheryl L. Chang, Quang Dang Nguyen, Carl J. E. Suster, Christina M. Jamerlan, Rebecca J. Rockett, Vitali Sintchenko, Tania C. Sorrell, Alexandra Martiniuk, Mikhail Prokopenko
Recurrent waves which are often observed during long pandemics typically form as a result of several interrelated dynamics including public health interventions, population mobility and behaviour, varying disease transmissibility due to pathogen mutations, and changes in host immunity due to recency of vaccination or previous infections. Complex nonlinear dependencies among these dynamics, including feedback between disease incidence and the opinion-driven adoption of social distancing behaviour, remain poorly understood, particularly in scenarios involving heterogeneous population, partial and waning immunity, and rapidly changing public opinions. This study addressed this challenge by proposing an opinion dynamics model that accounts for changes in social distancing behaviour (i.e., whether to adopt social distancing) by modelling both individual risk perception and peer pressure. The opinion dynamics model was integrated and validated within a large-scale agent-based COVID-19 pandemic simulation that modelled the spread of the Omicron variant of SARS-CoV-2 between December 2021 and June 2022 in Australia. Our study revealed that the fluctuating adoption of social distancing, shaped by individual risk aversion and social peer pressure from both household and workplace environments, may explain the observed pattern of recurrent waves of infections.
{"title":"Impact of opinion dynamics on recurrent pandemic waves: balancing risk aversion and peer pressure","authors":"Sheryl L. Chang, Quang Dang Nguyen, Carl J. E. Suster, Christina M. Jamerlan, Rebecca J. Rockett, Vitali Sintchenko, Tania C. Sorrell, Alexandra Martiniuk, Mikhail Prokopenko","doi":"arxiv-2408.00011","DOIUrl":"https://doi.org/arxiv-2408.00011","url":null,"abstract":"Recurrent waves which are often observed during long pandemics typically form\u0000as a result of several interrelated dynamics including public health\u0000interventions, population mobility and behaviour, varying disease\u0000transmissibility due to pathogen mutations, and changes in host immunity due to\u0000recency of vaccination or previous infections. Complex nonlinear dependencies\u0000among these dynamics, including feedback between disease incidence and the\u0000opinion-driven adoption of social distancing behaviour, remain poorly\u0000understood, particularly in scenarios involving heterogeneous population,\u0000partial and waning immunity, and rapidly changing public opinions. This study\u0000addressed this challenge by proposing an opinion dynamics model that accounts\u0000for changes in social distancing behaviour (i.e., whether to adopt social\u0000distancing) by modelling both individual risk perception and peer pressure. The\u0000opinion dynamics model was integrated and validated within a large-scale\u0000agent-based COVID-19 pandemic simulation that modelled the spread of the\u0000Omicron variant of SARS-CoV-2 between December 2021 and June 2022 in Australia.\u0000Our study revealed that the fluctuating adoption of social distancing, shaped\u0000by individual risk aversion and social peer pressure from both household and\u0000workplace environments, may explain the observed pattern of recurrent waves of\u0000infections.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141881709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swati AdhikariThe University of Burdwan, Parthajit RoyThe University of Burdwan
Cavities on the structures of proteins are formed due to interaction between proteins and some small molecules, known as ligands. These are basically the locations where ligands bind with proteins. Actual detection of such locations is all-important to succeed in the entire drug design process. This study proposes a Voronoi Tessellation based novel cavity detection model that is used to detect cavities on the structure of proteins. As the atom space of protein structure is dense and of large volumes and the DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm can handle such type of data very well as well as it is not mandatory to have knowledge about the numbers of clusters (cavities) in data as priori in this algorithm, this study proposes to implement the proposed algorithm with the DBSCAN algorithm.
{"title":"CavDetect: A DBSCAN Algorithm based Novel Cavity Detection Model on Protein Structure","authors":"Swati AdhikariThe University of Burdwan, Parthajit RoyThe University of Burdwan","doi":"arxiv-2407.18317","DOIUrl":"https://doi.org/arxiv-2407.18317","url":null,"abstract":"Cavities on the structures of proteins are formed due to interaction between\u0000proteins and some small molecules, known as ligands. These are basically the\u0000locations where ligands bind with proteins. Actual detection of such locations\u0000is all-important to succeed in the entire drug design process. This study\u0000proposes a Voronoi Tessellation based novel cavity detection model that is used\u0000to detect cavities on the structure of proteins. As the atom space of protein\u0000structure is dense and of large volumes and the DBSCAN (Density Based Spatial\u0000Clustering of Applications with Noise) algorithm can handle such type of data\u0000very well as well as it is not mandatory to have knowledge about the numbers of\u0000clusters (cavities) in data as priori in this algorithm, this study proposes to\u0000implement the proposed algorithm with the DBSCAN algorithm.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengzhi Zhang, Jules Nde, Yossi Eliaz, Nathaniel Jennings, Piotr Cieplak, Margaret. S. Cheung
Proteins' fuzziness are features for communicating changes in cell signaling instigated by binding with secondary messengers, such as calcium ions, associated with the coordination of muscle contraction, neurotransmitter release, and gene expression. Binding with the disordered parts of a protein, calcium ions must balance their charge states with the shape of calcium-binding proteins and their versatile pool of partners depending on the circumstances they transmit, but it is unclear whether the limited experimental data available can be used to train models to accurately predict the charges of calcium-binding protein variants. Here, we developed a chemistry-informed, machine-learning algorithm that implements a game theoretic approach to explain the output of a machine-learning model without the prerequisite of an excessively large database for high-performance prediction of atomic charges. We used the ab initio electronic structure data representing calcium ions and the structures of the disordered segments of calcium-binding peptides with surrounding water molecules to train several explainable models. Network theory was used to extract the topological features of atomic interactions in the structurally complex data dictated by the coordination chemistry of a calcium ion, a potent indicator of its charge state in protein. With our designs, we provided a framework of explainable machine learning model to annotate atomic charges of calcium ions in calcium-binding proteins with domain knowledge in response to the chemical changes in an environment based on the limited size of scientific data in a genome space.
{"title":"Chemistry-informed Machine Learning Explains Calcium-binding Proteins Fuzzy Shape for Communicating Changes in the Atomic States of Calcium Ions","authors":"Pengzhi Zhang, Jules Nde, Yossi Eliaz, Nathaniel Jennings, Piotr Cieplak, Margaret. S. Cheung","doi":"arxiv-2407.17017","DOIUrl":"https://doi.org/arxiv-2407.17017","url":null,"abstract":"Proteins' fuzziness are features for communicating changes in cell signaling\u0000instigated by binding with secondary messengers, such as calcium ions,\u0000associated with the coordination of muscle contraction, neurotransmitter\u0000release, and gene expression. Binding with the disordered parts of a protein,\u0000calcium ions must balance their charge states with the shape of calcium-binding\u0000proteins and their versatile pool of partners depending on the circumstances\u0000they transmit, but it is unclear whether the limited experimental data\u0000available can be used to train models to accurately predict the charges of\u0000calcium-binding protein variants. Here, we developed a chemistry-informed,\u0000machine-learning algorithm that implements a game theoretic approach to explain\u0000the output of a machine-learning model without the prerequisite of an\u0000excessively large database for high-performance prediction of atomic charges.\u0000We used the ab initio electronic structure data representing calcium ions and\u0000the structures of the disordered segments of calcium-binding peptides with\u0000surrounding water molecules to train several explainable models. Network theory\u0000was used to extract the topological features of atomic interactions in the\u0000structurally complex data dictated by the coordination chemistry of a calcium\u0000ion, a potent indicator of its charge state in protein. With our designs, we\u0000provided a framework of explainable machine learning model to annotate atomic\u0000charges of calcium ions in calcium-binding proteins with domain knowledge in\u0000response to the chemical changes in an environment based on the limited size of\u0000scientific data in a genome space.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufeng Li, Wenchao Zhao, Bo Dang, Xu Yan, Weimin Wang, Min Gao, Mingxuan Xiao
In clinical treatment, identifying potential adverse reactions of drugs can help assist doctors in making medication decisions. In response to the problems in previous studies that features are high-dimensional and sparse, independent prediction models need to be constructed for each adverse reaction of drugs, and the prediction accuracy is low, this paper develops an adverse drug reaction prediction model based on knowledge graph embedding and deep learning, which can predict experimental results. Unified prediction of adverse drug reactions covered. Knowledge graph embedding technology can fuse the associated information between drugs and alleviate the shortcomings of high-dimensional sparsity in feature matrices, and the efficient training capabilities of deep learning can improve the prediction accuracy of the model. This article builds an adverse drug reaction knowledge graph based on drug feature data; by analyzing the embedding effect of the knowledge graph under different embedding strategies, the best embedding strategy is selected to obtain sample vectors; and then a convolutional neural network model is constructed to predict adverse reactions. The results show that under the DistMult embedding model and 400-dimensional embedding strategy, the convolutional neural network model has the best prediction effect; the average accuracy, F_1 score, recall rate and area under the curve of repeated experiments are better than the methods reported in the literature. The obtained prediction model has good prediction accuracy and stability, and can provide an effective reference for later safe medication guidance.
{"title":"Research on Adverse Drug Reaction Prediction Model Combining Knowledge Graph Embedding and Deep Learning","authors":"Yufeng Li, Wenchao Zhao, Bo Dang, Xu Yan, Weimin Wang, Min Gao, Mingxuan Xiao","doi":"arxiv-2407.16715","DOIUrl":"https://doi.org/arxiv-2407.16715","url":null,"abstract":"In clinical treatment, identifying potential adverse reactions of drugs can\u0000help assist doctors in making medication decisions. In response to the problems\u0000in previous studies that features are high-dimensional and sparse, independent\u0000prediction models need to be constructed for each adverse reaction of drugs,\u0000and the prediction accuracy is low, this paper develops an adverse drug\u0000reaction prediction model based on knowledge graph embedding and deep learning,\u0000which can predict experimental results. Unified prediction of adverse drug\u0000reactions covered. Knowledge graph embedding technology can fuse the associated\u0000information between drugs and alleviate the shortcomings of high-dimensional\u0000sparsity in feature matrices, and the efficient training capabilities of deep\u0000learning can improve the prediction accuracy of the model. This article builds\u0000an adverse drug reaction knowledge graph based on drug feature data; by\u0000analyzing the embedding effect of the knowledge graph under different embedding\u0000strategies, the best embedding strategy is selected to obtain sample vectors;\u0000and then a convolutional neural network model is constructed to predict adverse\u0000reactions. The results show that under the DistMult embedding model and\u0000400-dimensional embedding strategy, the convolutional neural network model has\u0000the best prediction effect; the average accuracy, F_1 score, recall rate and\u0000area under the curve of repeated experiments are better than the methods\u0000reported in the literature. The obtained prediction model has good prediction\u0000accuracy and stability, and can provide an effective reference for later safe\u0000medication guidance.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheikh Mohammed Shariful Islam, Moloud Abrar, Teketo Tegegne, Liliana Loranjo, Chandan Karmakar, Md Abdul Awal, Md. Shahadat Hossain, Muhammad Ashad Kabir, Mufti Mahmud, Abbas Khosravi, George Siopis, Jeban C Moses, Ralph Maddison
Machine learning models have the potential to identify cardiovascular diseases (CVDs) early and accurately in primary healthcare settings, which is crucial for delivering timely treatment and management. Although population-based CVD risk models have been used traditionally, these models often do not consider variations in lifestyles, socioeconomic conditions, or genetic predispositions. Therefore, we aimed to develop machine learning models for CVD detection using primary healthcare data, compare the performance of different models, and identify the best models. We used data from the UK Biobank study, which included over 500,000 middle-aged participants from different primary healthcare centers in the UK. Data collected at baseline (2006--2010) and during imaging visits after 2014 were used in this study. Baseline characteristics, including sex, age, and the Townsend Deprivation Index, were included. Participants were classified as having CVD if they reported at least one of the following conditions: heart attack, angina, stroke, or high blood pressure. Cardiac imaging data such as electrocardiogram and echocardiography data, including left ventricular size and function, cardiac output, and stroke volume, were also used. We used 9 machine learning models (LSVM, RBFSVM, GP, DT, RF, NN, AdaBoost, NB, and QDA), which are explainable and easily interpretable. We reported the accuracy, precision, recall, and F-1 scores; confusion matrices; and area under the curve (AUC) curves.
{"title":"Machine Learning Models for the Identification of Cardiovascular Diseases Using UK Biobank Data","authors":"Sheikh Mohammed Shariful Islam, Moloud Abrar, Teketo Tegegne, Liliana Loranjo, Chandan Karmakar, Md Abdul Awal, Md. Shahadat Hossain, Muhammad Ashad Kabir, Mufti Mahmud, Abbas Khosravi, George Siopis, Jeban C Moses, Ralph Maddison","doi":"arxiv-2407.16721","DOIUrl":"https://doi.org/arxiv-2407.16721","url":null,"abstract":"Machine learning models have the potential to identify cardiovascular\u0000diseases (CVDs) early and accurately in primary healthcare settings, which is\u0000crucial for delivering timely treatment and management. Although\u0000population-based CVD risk models have been used traditionally, these models\u0000often do not consider variations in lifestyles, socioeconomic conditions, or\u0000genetic predispositions. Therefore, we aimed to develop machine learning models\u0000for CVD detection using primary healthcare data, compare the performance of\u0000different models, and identify the best models. We used data from the UK\u0000Biobank study, which included over 500,000 middle-aged participants from\u0000different primary healthcare centers in the UK. Data collected at baseline\u0000(2006--2010) and during imaging visits after 2014 were used in this study.\u0000Baseline characteristics, including sex, age, and the Townsend Deprivation\u0000Index, were included. Participants were classified as having CVD if they\u0000reported at least one of the following conditions: heart attack, angina,\u0000stroke, or high blood pressure. Cardiac imaging data such as electrocardiogram\u0000and echocardiography data, including left ventricular size and function,\u0000cardiac output, and stroke volume, were also used. We used 9 machine learning\u0000models (LSVM, RBFSVM, GP, DT, RF, NN, AdaBoost, NB, and QDA), which are\u0000explainable and easily interpretable. We reported the accuracy, precision,\u0000recall, and F-1 scores; confusion matrices; and area under the curve (AUC)\u0000curves.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ari Blau, Evan S Schaffer, Neeli Mishra, Nathaniel J Miska, The International Brain Laboratory, Liam Paninski, Matthew R Whiteway
Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms -- which include tree-based models, deep neural networks, and graphical models -- differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species -- fly, mouse, and human -- we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.
{"title":"A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms","authors":"Ari Blau, Evan S Schaffer, Neeli Mishra, Nathaniel J Miska, The International Brain Laboratory, Liam Paninski, Matthew R Whiteway","doi":"arxiv-2407.16727","DOIUrl":"https://doi.org/arxiv-2407.16727","url":null,"abstract":"Action segmentation of behavioral videos is the process of labeling each\u0000frame as belonging to one or more discrete classes, and is a crucial component\u0000of many studies that investigate animal behavior. A wide range of algorithms\u0000exist to automatically parse discrete animal behavior, encompassing supervised,\u0000unsupervised, and semi-supervised learning paradigms. These algorithms -- which\u0000include tree-based models, deep neural networks, and graphical models -- differ\u0000widely in their structure and assumptions on the data. Using four datasets\u0000spanning multiple species -- fly, mouse, and human -- we systematically study\u0000how the outputs of these various algorithms align with manually annotated\u0000behaviors of interest. Along the way, we introduce a semi-supervised action\u0000segmentation model that bridges the gap between supervised deep neural networks\u0000and unsupervised graphical models. We find that fully supervised temporal\u0000convolutional networks with the addition of temporal information in the\u0000observations perform the best on our supervised metrics across all datasets.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"355 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}