Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0039
Nicolae Sapoval, Marko Tanevski, T. Treangen
The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb
人类胃肠道中的微生物经常与人类健康和疾病结果联系在一起。近年来,由于技术和方法上的进步,元基因组测序数据和用于分析元基因组数据的计算方法有助于人们更好地了解人类肠道微生物组与疾病之间的联系。然而,尽管最近已开发出许多方法来从宿主相关微生物组数据中提取定量和定性结果,但仍需要改进计算工具来利用短线程测序数据跟踪微生物组动态。在此之前,我们已经提出了 KOMB 作为一种全新的工具,用于识别元基因组中的拷贝数变异,以描述微生物基因组对扰动的动态响应。在这项工作中,我们提出了 KombOver (KO),它与我们之前的工作相比有四个主要贡献:(i) 它可扩展到大型微生物组研究队列;(ii) 它包括基于 K 核和 K 桁架的分析;(iii) 我们为理解各种基于图的元基因组表示之间的关系提供了理论基础;(iv) 我们提供了更好的用户体验,代码更易于运行,输出/结果更具描述性。为了突出上述优势,我们将 KO 应用于近 1000 个人类微生物组样本,每个样本只需不到 10 分钟和 10 GB 内存就能处理这些数据。此外,我们还强调了基于图的方法(如 K-core 和 K-truss)如何为确定肌痛性脑脊髓炎/慢性疲劳综合征(ME/CFS)队列中的微生物群落动态提供信息。KO 是开放源代码,可在以下网站下载/使用: https://github.com/treangenlab/komb
{"title":"KombOver: Efficient k-core and K-truss based characterization of perturbations within the human gut microbiome","authors":"Nicolae Sapoval, Marko Tanevski, T. Treangen","doi":"10.1142/9789811286421_0039","DOIUrl":"https://doi.org/10.1142/9789811286421_0039","url":null,"abstract":"The microbes present in the human gastrointestinal tract are regularly linked to human health and disease outcomes. Thanks to technological and methodological advances in recent years, metagenomic sequencing data, and computational methods designed to analyze metagenomic data, have contributed to improved understanding of the link between the human gut microbiome and disease. However, while numerous methods have been recently developed to extract quantitative and qualitative results from host-associated microbiome data, improved computational tools are still needed to track microbiome dynamics with short-read sequencing data. Previously we have proposed KOMB as a de novo tool for identifying copy number variations in metagenomes for characterizing microbial genome dynamics in response to perturbations. In this work, we present KombOver (KO), which includes four key contributions with respect to our previous work: (i) it scales to large microbiome study cohorts, (ii) it includes both k-core and K-truss based analysis, (iii) we provide the foundation of a theoretical understanding of the relation between various graph-based metagenome representations, and (iv) we provide an improved user experience with easier-to-run code and more descriptive outputs/results. To highlight the aforementioned benefits, we applied KO to nearly 1000 human microbiome samples, requiring less than 10 minutes and 10 GB RAM per sample to process these data. Furthermore, we highlight how graph-based approaches such as k-core and K-truss can be informative for pinpointing microbial community dynamics within a myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) cohort. KO is open source and available for download/use at: https://github.com/treangenlab/komb","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"28 4","pages":"506 - 520"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0020
Rachel A. Hoffing, A. Deaton, Aaron M. Holleman, Lynne Krohn, Philip J. LoGerfo, Mollie E. Plekan, Sebastian Akle Serrano, P. Nioi, Lucas D. Ward
A single gene can produce multiple transcripts with distinct molecular functions. Rare-variant association tests often aggregate all coding variants across individual genes, without accounting for the variants' presence or consequence in resulting transcript isoforms. To evaluate the utility of transcript-aware variant sets, rare predicted loss-of-function (pLOF) variants were aggregated for 17,035 protein-coding genes using 55,558 distinct transcript-specific variant sets. These sets were tested for their association with 728 circulating proteins and 188 quantitative phenotypes across 406,921 individuals in the UK Biobank. The transcript-specific approach resulted in larger estimated effects of pLOF variants decreasing serum cis-protein levels compared to the gene-based approach (pbinom ≤ 2x10-16). Additionally, 251 quantitative trait associations were identified as being significant using the transcript-specific approach but not the gene-based approach, including PCSK5 transcript ENST00000376752 and standing height (transcript-specific statistic, P = 1.3x10-16, effect = 0.7 SD decrease; gene-based statistic, P = 0.02, effect = 0.05 SD decrease) and LDLR transcript ENST00000252444 and apolipoprotein B (transcript-specific statistic, P = 5.7x10-20, effect = 1.0 SD increase; gene-based statistic, P = 3.0x10-4, effect = 0.2 SD increase). This approach demonstrates the importance of considering the effect of pLOFs on specific transcript isoforms when performing rare-variant association studies.
{"title":"Transcript-aware analysis of rare predicted loss-of-function variants in the UK Biobank elucidate new isoform-trait associations.","authors":"Rachel A. Hoffing, A. Deaton, Aaron M. Holleman, Lynne Krohn, Philip J. LoGerfo, Mollie E. Plekan, Sebastian Akle Serrano, P. Nioi, Lucas D. Ward","doi":"10.1142/9789811286421_0020","DOIUrl":"https://doi.org/10.1142/9789811286421_0020","url":null,"abstract":"A single gene can produce multiple transcripts with distinct molecular functions. Rare-variant association tests often aggregate all coding variants across individual genes, without accounting for the variants' presence or consequence in resulting transcript isoforms. To evaluate the utility of transcript-aware variant sets, rare predicted loss-of-function (pLOF) variants were aggregated for 17,035 protein-coding genes using 55,558 distinct transcript-specific variant sets. These sets were tested for their association with 728 circulating proteins and 188 quantitative phenotypes across 406,921 individuals in the UK Biobank. The transcript-specific approach resulted in larger estimated effects of pLOF variants decreasing serum cis-protein levels compared to the gene-based approach (pbinom ≤ 2x10-16). Additionally, 251 quantitative trait associations were identified as being significant using the transcript-specific approach but not the gene-based approach, including PCSK5 transcript ENST00000376752 and standing height (transcript-specific statistic, P = 1.3x10-16, effect = 0.7 SD decrease; gene-based statistic, P = 0.02, effect = 0.05 SD decrease) and LDLR transcript ENST00000252444 and apolipoprotein B (transcript-specific statistic, P = 5.7x10-20, effect = 1.0 SD increase; gene-based statistic, P = 3.0x10-4, effect = 0.2 SD increase). This approach demonstrates the importance of considering the effect of pLOFs on specific transcript isoforms when performing rare-variant association studies.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"760 ","pages":"247-260"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0005
Alexis Li, Yi Yang, Hejie Cui, Carl Yang
Functional brain networks represent dynamic and complex interactions among anatomical regions of interest (ROIs), providing crucial clinical insights for neural pattern discovery and disorder diagnosis. In recent years, graph neural networks (GNNs) have proven immense success and effectiveness in analyzing structured network data. However, due to the high complexity of data acquisition, resulting in limited training resources of neuroimaging data, GNNs, like all deep learning models, suffer from overfitting. Moreover, their capability to capture useful neural patterns for downstream prediction is also adversely affected. To address such challenge, this study proposes BrainSTEAM, an integrated framework featuring a spatio-temporal module that consists of an EdgeConv GNN model, an autoencoder network, and a Mixup strategy. In particular, the spatio-temporal module aims to dynamically segment the time series signals of the ROI features for each subject into chunked sequences. We leverage each sequence to construct correlation networks, thereby increasing the training data. Additionally, we employ the EdgeConv GNN to capture ROI connectivity structures, an autoencoder for data denoising, and mixup for enhancing model training through linear data augmentation. We evaluate our framework on two real-world neuroimaging datasets, ABIDE for Autism prediction and HCP for gender prediction. Extensive experiments demonstrate the superiority and robustness of BrainSTEAM when compared to a variety of existing models, showcasing the strong potential of our proposed mechanisms in generalizing to other studies for connectome-based fMRI analysis.
{"title":"BrainSTEAM: A Practical Pipeline for Connectome-based fMRI Analysis towards Subject Classification.","authors":"Alexis Li, Yi Yang, Hejie Cui, Carl Yang","doi":"10.1142/9789811286421_0005","DOIUrl":"https://doi.org/10.1142/9789811286421_0005","url":null,"abstract":"Functional brain networks represent dynamic and complex interactions among anatomical regions of interest (ROIs), providing crucial clinical insights for neural pattern discovery and disorder diagnosis. In recent years, graph neural networks (GNNs) have proven immense success and effectiveness in analyzing structured network data. However, due to the high complexity of data acquisition, resulting in limited training resources of neuroimaging data, GNNs, like all deep learning models, suffer from overfitting. Moreover, their capability to capture useful neural patterns for downstream prediction is also adversely affected. To address such challenge, this study proposes BrainSTEAM, an integrated framework featuring a spatio-temporal module that consists of an EdgeConv GNN model, an autoencoder network, and a Mixup strategy. In particular, the spatio-temporal module aims to dynamically segment the time series signals of the ROI features for each subject into chunked sequences. We leverage each sequence to construct correlation networks, thereby increasing the training data. Additionally, we employ the EdgeConv GNN to capture ROI connectivity structures, an autoencoder for data denoising, and mixup for enhancing model training through linear data augmentation. We evaluate our framework on two real-world neuroimaging datasets, ABIDE for Autism prediction and HCP for gender prediction. Extensive experiments demonstrate the superiority and robustness of BrainSTEAM when compared to a variety of existing models, showcasing the strong potential of our proposed mechanisms in generalizing to other studies for connectome-based fMRI analysis.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"418 1","pages":"53-64"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0023
Armand Ovanessians, Carson Snow, Thomas Jennewein, Susanta Sarkar, Gil Speyer, Judith Klein-Seetharaman
Assembling an "integrated structural map of the human cell" at atomic resolution will require a complete set of all human protein structures available for interaction with other biomolecules - the human protein structure targetome - and a pipeline of automated tools that allow quantitative analysis of millions of protein-ligand interactions. Toward this goal, we here describe the creation of a curated database of experimentally determined human protein structures. Starting with the sequences of 20,422 human proteins, we selected the most representative structure for each protein (if available) from the protein database (PDB), ranking structures by coverage of sequence by structure, depth (the difference between the final and initial residue number of each chain), resolution, and experimental method used to determine the structure. To enable expansion into an entire human targetome, we docked small molecule ligands to our curated set of protein structures. Using design constraints derived from comparing structure assembly and ligand docking results obtained with challenging protein examples, we here propose to combine this curated database of experimental structures with AlphaFold predictions and multi-domain assembly using DEMO2 in the future. To demonstrate the utility of our curated database in identification of the human protein structure targetome, we used docking with AutoDock Vina and created tools for automated analysis of affinity and binding site locations of the thousands of protein-ligand prediction results. The resulting human targetome, which can be updated and expanded with an evolving curated database and increasing numbers of ligands, is a valuable addition to the growing toolkit of structural bioinformatics.
{"title":"Creation of a Curated Database of Experimentally Determined Human Protein Structures for the Identification of Its Targetome.","authors":"Armand Ovanessians, Carson Snow, Thomas Jennewein, Susanta Sarkar, Gil Speyer, Judith Klein-Seetharaman","doi":"10.1142/9789811286421_0023","DOIUrl":"https://doi.org/10.1142/9789811286421_0023","url":null,"abstract":"Assembling an \"integrated structural map of the human cell\" at atomic resolution will require a complete set of all human protein structures available for interaction with other biomolecules - the human protein structure targetome - and a pipeline of automated tools that allow quantitative analysis of millions of protein-ligand interactions. Toward this goal, we here describe the creation of a curated database of experimentally determined human protein structures. Starting with the sequences of 20,422 human proteins, we selected the most representative structure for each protein (if available) from the protein database (PDB), ranking structures by coverage of sequence by structure, depth (the difference between the final and initial residue number of each chain), resolution, and experimental method used to determine the structure. To enable expansion into an entire human targetome, we docked small molecule ligands to our curated set of protein structures. Using design constraints derived from comparing structure assembly and ligand docking results obtained with challenging protein examples, we here propose to combine this curated database of experimental structures with AlphaFold predictions and multi-domain assembly using DEMO2 in the future. To demonstrate the utility of our curated database in identification of the human protein structure targetome, we used docking with AutoDock Vina and created tools for automated analysis of affinity and binding site locations of the thousands of protein-ligand prediction results. The resulting human targetome, which can be updated and expanded with an evolving curated database and increasing numbers of ligands, is a valuable addition to the growing toolkit of structural bioinformatics.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"350 1","pages":"291-305"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0013
Michelle Holko, Chris Lunt, Jessilyn P Dunn
Data from digital health technologies (DHT), including wearable sensors like Apple Watch, Whoop, Oura Ring, and Fitbit, are increasingly being used in biomedical research. Research and development of DHT-related devices, platforms, and applications is happening rapidly and with significant private-sector involvement with new biotech companies and large tech companies (e.g. Google, Apple, Amazon, Uber) investing heavily in technologies to improve human health. Many academic institutions are building capabilities related to DHT research, often in cross-sector collaboration with technology companies and other organizations with the goal of generating clinically meaningful evidence to improve patient care, to identify users at an earlier stage of disease presentation, and to support health preservation and disease prevention. Large research consortia, cross-sector partnerships, and individual research labs are all represented in the current corpus of published studies. Some of the large research studies, like NIH's All of Us Research Program, make data sets from wearable sensors available to the research community, while the vast majority of data from wearable sensors and other DHTs are held by private sector organizations and are not readily available to the research community. As data are unlocked from the private sector and made available to the academic research community, there is an opportunity to develop innovative analytics and methods through expanded access. This is the second year for this Session which solicited research results leveraging digital health technologies, including wearable sensor data, describing novel analytical methods, and issues related to diversity, equity, inclusion (DEI) of the research, data, and the community of researchers working in this area. We particularly encouraged submissions describing opportunities for expanding and democratizing academic research using data from wearable sensors and related digital health technologies.
来自数字健康技术(DHT)的数据,包括 Apple Watch、Whoop、Oura Ring 和 Fitbit 等可穿戴传感器的数据,正越来越多地被用于生物医学研究。与数字健康技术相关的设备、平台和应用的研究与开发正在快速进行,新兴生物技术公司和大型科技公司(如谷歌、苹果、亚马逊、优步等)大量投资于改善人类健康的技术,私营部门也积极参与其中。许多学术机构正在建设与 DHT 研究相关的能力,通常是与技术公司和其他组织开展跨部门合作,目标是提供有临床意义的证据,以改善患者护理,在疾病的早期阶段识别用户,并支持健康保护和疾病预防。在目前已发表的研究成果中,大型研究联盟、跨部门合作以及单个研究实验室均有体现。一些大型研究,如美国国立卫生研究院的 "我们所有人研究计划",向研究界提供了来自可穿戴传感器的数据集,而来自可穿戴传感器和其他 DHT 的绝大多数数据都由私营部门组织掌握,不能随时向研究界提供。随着数据从私营部门解锁并提供给学术研究界,有机会通过扩大访问范围来开发创新的分析方法和手段。今年是该会议举办的第二年,会议征集了利用数字健康技术(包括可穿戴传感器数据)的研究成果,介绍了新颖的分析方法,以及与该领域的研究、数据和研究人员群体的多样性、公平性和包容性(DEI)相关的问题。我们特别鼓励在提交的论文中描述利用可穿戴传感器和相关数字健康技术的数据扩大学术研究并使之民主化的机会。
{"title":"Session Introduction: Digital health technology data in biocomputing: Research efforts and considerations for expanding access (PSB2024).","authors":"Michelle Holko, Chris Lunt, Jessilyn P Dunn","doi":"10.1142/9789811286421_0013","DOIUrl":"https://doi.org/10.1142/9789811286421_0013","url":null,"abstract":"Data from digital health technologies (DHT), including wearable sensors like Apple Watch, Whoop, Oura Ring, and Fitbit, are increasingly being used in biomedical research. Research and development of DHT-related devices, platforms, and applications is happening rapidly and with significant private-sector involvement with new biotech companies and large tech companies (e.g. Google, Apple, Amazon, Uber) investing heavily in technologies to improve human health. Many academic institutions are building capabilities related to DHT research, often in cross-sector collaboration with technology companies and other organizations with the goal of generating clinically meaningful evidence to improve patient care, to identify users at an earlier stage of disease presentation, and to support health preservation and disease prevention. Large research consortia, cross-sector partnerships, and individual research labs are all represented in the current corpus of published studies. Some of the large research studies, like NIH's All of Us Research Program, make data sets from wearable sensors available to the research community, while the vast majority of data from wearable sensors and other DHTs are held by private sector organizations and are not readily available to the research community. As data are unlocked from the private sector and made available to the academic research community, there is an opportunity to develop innovative analytics and methods through expanded access. This is the second year for this Session which solicited research results leveraging digital health technologies, including wearable sensor data, describing novel analytical methods, and issues related to diversity, equity, inclusion (DEI) of the research, data, and the community of researchers working in this area. We particularly encouraged submissions describing opportunities for expanding and democratizing academic research using data from wearable sensors and related digital health technologies.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"47 3","pages":"163-169"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0052
Mengzhou Hu, Xikun Zhang, Andrew Latham, Andrej Šali, T. Ideker, Emma Lundberg
Cells consist of large components, such as organelles, that recursively factor into smaller systems, such as condensates and protein complexes, forming a dynamic multi-scale structure of the cell. Recent technological innovations have paved the way for systematic interrogation of subcellular structures, yielding unprecedented insights into their roles and interactions. In this workshop, we discuss progress, challenges, and collaboration to marshal various computational approaches toward assembling an integrated structural map of the human cell.
{"title":"Tools for assembling the cell: Towards the era of cell structural bioinformatics.","authors":"Mengzhou Hu, Xikun Zhang, Andrew Latham, Andrej Šali, T. Ideker, Emma Lundberg","doi":"10.1142/9789811286421_0052","DOIUrl":"https://doi.org/10.1142/9789811286421_0052","url":null,"abstract":"Cells consist of large components, such as organelles, that recursively factor into smaller systems, such as condensates and protein complexes, forming a dynamic multi-scale structure of the cell. Recent technological innovations have paved the way for systematic interrogation of subcellular structures, yielding unprecedented insights into their roles and interactions. In this workshop, we discuss progress, challenges, and collaboration to marshal various computational approaches toward assembling an integrated structural map of the human cell.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"794 ","pages":"661-665"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0009
Charmi Patel, Yiyang Wang, Thiruvarangan Ramaraj, Roselyne B. Tchoua, Jacob Furst, D. Raicu
Classical machine learning and deep learning models for Computer-Aided Diagnosis (CAD) commonly focus on overall classification performance, treating misclassification errors (false negatives and false positives) equally during training. This uniform treatment overlooks the distinct costs associated with each type of error, leading to suboptimal decision-making, particularly in the medical domain where it is important to improve the prediction sensitivity without significantly compromising overall accuracy. This study introduces a novel deep learning-based CAD system that incorporates a cost-sensitive parameter into the activation function. By applying our methodologies to two medical imaging datasets, our proposed study shows statistically significant increases of 3.84% and 5.4% in sensitivity while maintaining overall accuracy for Lung Image Database Consortium (LIDC) and Breast Cancer Histological Database (BreakHis), respectively. Our findings underscore the significance of integrating cost-sensitive parameters into future CAD systems to optimize performance and ultimately reduce costs and improve patient outcomes.
{"title":"Optimizing Computer-Aided Diagnosis with Cost-Aware Deep Learning Models.","authors":"Charmi Patel, Yiyang Wang, Thiruvarangan Ramaraj, Roselyne B. Tchoua, Jacob Furst, D. Raicu","doi":"10.1142/9789811286421_0009","DOIUrl":"https://doi.org/10.1142/9789811286421_0009","url":null,"abstract":"Classical machine learning and deep learning models for Computer-Aided Diagnosis (CAD) commonly focus on overall classification performance, treating misclassification errors (false negatives and false positives) equally during training. This uniform treatment overlooks the distinct costs associated with each type of error, leading to suboptimal decision-making, particularly in the medical domain where it is important to improve the prediction sensitivity without significantly compromising overall accuracy. This study introduces a novel deep learning-based CAD system that incorporates a cost-sensitive parameter into the activation function. By applying our methodologies to two medical imaging datasets, our proposed study shows statistically significant increases of 3.84% and 5.4% in sensitivity while maintaining overall accuracy for Lung Image Database Consortium (LIDC) and Breast Cancer Histological Database (BreakHis), respectively. Our findings underscore the significance of integrating cost-sensitive parameters into future CAD systems to optimize performance and ultimately reduce costs and improve patient outcomes.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"44 12","pages":"108-119"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0033
Brooke Rhead, Paige E. Haffener, Y. Pouliot, Francisco M. De La Vega
The incompleteness of race and ethnicity information in real-world data (RWD) hampers its utility in promoting healthcare equity. This study introduces two methods-one heuristic and the other machine learning-based-to impute race and ethnicity from genetic ancestry using tumor profiling data. Analyzing de-identified data from over 100,000 cancer patients sequenced with the Tempus xT panel, we demonstrate that both methods outperform existing geolocation and surname-based methods, with the machine learning approach achieving high recall (range: 0.859-0.993) and precision (range: 0.932-0.981) across four mutually exclusive race and ethnicity categories. This work presents a novel pathway to enhance RWD utility in studying racial disparities in healthcare.
真实世界数据(RWD)中种族和民族信息的不完整性阻碍了其在促进医疗公平方面的作用。本研究介绍了两种方法--一种是启发式方法,另一种是基于机器学习的方法--利用肿瘤图谱数据从遗传祖先推算种族和人种。通过分析用 Tempus xT 面板测序的 10 万多名癌症患者的去标识化数据,我们证明这两种方法都优于现有的基于地理位置和姓氏的方法,其中机器学习方法在四个相互排斥的种族和民族类别中实现了高召回率(范围:0.859-0.993)和高精确度(范围:0.932-0.981)。这项工作提出了一种新的途径,以提高 RWD 在研究医疗保健中种族差异方面的效用。
{"title":"Imputation of race and ethnicity categories using genetic ancestry from real-world genomic testing data.","authors":"Brooke Rhead, Paige E. Haffener, Y. Pouliot, Francisco M. De La Vega","doi":"10.1142/9789811286421_0033","DOIUrl":"https://doi.org/10.1142/9789811286421_0033","url":null,"abstract":"The incompleteness of race and ethnicity information in real-world data (RWD) hampers its utility in promoting healthcare equity. This study introduces two methods-one heuristic and the other machine learning-based-to impute race and ethnicity from genetic ancestry using tumor profiling data. Analyzing de-identified data from over 100,000 cancer patients sequenced with the Tempus xT panel, we demonstrate that both methods outperform existing geolocation and surname-based methods, with the machine learning approach achieving high recall (range: 0.859-0.993) and precision (range: 0.932-0.981) across four mutually exclusive race and ethnicity categories. This work presents a novel pathway to enhance RWD utility in studying racial disparities in healthcare.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"123 ","pages":"433-445"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0019
Costa Georgantas, Jaume Banus, Roger Hullin, Jonas Richiardi
Drug repurposing (DR) intends to identify new uses for approved medications outside their original indication. Computational methods for finding DR candidates usually rely on prior biological and chemical information on a specific drug or target but rarely utilize real-world observations. In this work, we propose a simple and effective systematic screening approach to measure medication impact on hospitalization risk based on large-scale observational data. We use common classification systems to group drugs and diseases into broader functional categories and test for non-zero effects in each drug-disease category pair. Treatment effects on the hospitalization risk of an individual disease are obtained by combining widely used methods for causal inference and time-to-event modelling. 6468 drug-disease pairs were tested using data from the UK Biobank, focusing on cardiovascular, metabolic, and respiratory diseases. We determined key parameters to reduce the number of spurious correlations and identified 7 statistically significant associations of reduced hospitalization risk after correcting for multiple testing. Some of these associations were already reported in other studies, including new potential applications for cardioselective beta-blockers and thiazides. We also found evidence for proton pump inhibitor side effects and multiple possible associations for anti-diabetic drugs. Our work demonstrates the applicability of the present screening approach and the utility of real-world data for identifying potential DR candidates.
药物再利用(DR)旨在为已批准的药物确定其原始适应症之外的新用途。寻找 DR 候选药物的计算方法通常依赖于特定药物或靶点的先前生物和化学信息,但很少利用真实世界的观察结果。在这项工作中,我们提出了一种简单有效的系统筛选方法,基于大规模观察数据来衡量药物对住院风险的影响。我们使用常见的分类系统将药物和疾病归入更广泛的功能类别,并检验每个药物-疾病类别对的非零效应。通过结合广泛使用的因果推断和时间到事件建模方法,得出治疗对单个疾病住院风险的影响。我们利用英国生物库的数据对 6468 对药物-疾病配对进行了测试,重点关注心血管、代谢和呼吸系统疾病。我们确定了减少虚假相关性的关键参数,并在校正多重检验后确定了 7 种具有统计学意义的降低住院风险的相关性。其中一些关联在其他研究中已有报道,包括心脏选择性β受体阻滞剂和噻嗪类药物的新潜在应用。我们还发现了质子泵抑制剂副作用的证据以及抗糖尿病药物的多种可能关联。我们的工作证明了目前筛选方法的适用性以及真实世界数据在确定潜在 DR 候选药物方面的实用性。
{"title":"Systematic Estimation of Treatment Effect on Hospitalization Risk as a Drug Repurposing Screening Method.","authors":"Costa Georgantas, Jaume Banus, Roger Hullin, Jonas Richiardi","doi":"10.1142/9789811286421_0019","DOIUrl":"https://doi.org/10.1142/9789811286421_0019","url":null,"abstract":"Drug repurposing (DR) intends to identify new uses for approved medications outside their original indication. Computational methods for finding DR candidates usually rely on prior biological and chemical information on a specific drug or target but rarely utilize real-world observations. In this work, we propose a simple and effective systematic screening approach to measure medication impact on hospitalization risk based on large-scale observational data. We use common classification systems to group drugs and diseases into broader functional categories and test for non-zero effects in each drug-disease category pair. Treatment effects on the hospitalization risk of an individual disease are obtained by combining widely used methods for causal inference and time-to-event modelling. 6468 drug-disease pairs were tested using data from the UK Biobank, focusing on cardiovascular, metabolic, and respiratory diseases. We determined key parameters to reduce the number of spurious correlations and identified 7 statistically significant associations of reduced hospitalization risk after correcting for multiple testing. Some of these associations were already reported in other studies, including new potential applications for cardioselective beta-blockers and thiazides. We also found evidence for proton pump inhibitor side effects and multiple possible associations for anti-diabetic drugs. Our work demonstrates the applicability of the present screening approach and the utility of real-world data for identifying potential DR candidates.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"22 12","pages":"232-246"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-17DOI: 10.1142/9789811286421_0030
Jacqueline A. Piekos, Jeewoo Kim, Jacob M. Keaton, J. Hellwege, Todd L. Edwards, D. V. Velez Edwards
There is a desire in research to move away from the concept of race as a clinical factor because it is a societal construct used as an imprecise proxy for geographic ancestry. In this study, we leverage the biobank from Vanderbilt University Medical Center, BioVU, to investigate relationships between genetic ancestry proportion and the clinical phenome. For all samples in BioVU, we calculated six ancestry proportions based on 1000 Genomes references: eastern African (EAFR), western African (WAFR), northern European (NEUR), southern European (SEUR), eastern Asian (EAS), and southern Asian (SAS). From PheWAS, we found phecode categories significantly enriched neoplasms for EAFR, WAFR, and SEUR, and pregnancy complication in SEUR, NEUR, SAS, and EAS (p < 0.003). We then selected phenotypes hypertension (HTN) and atrial fibrillation (AFib) to further investigate the relationships between these phenotypes and EAFR, WAFR, SEUR, and NEUR using logistic regression modeling and non-linear restricted cubic spline modeling (RCS). For EAS and SAS, we chose renal failure (RF) for further modeling. The relationships between HTN and AFib and the ancestries EAFR, WAFR, and SEUR were best fit by the linear model (beta p < 1x10-4 for all) while the relationships with NEUR were best fit with RCS (HTN ANOVA p = 0.001, AFib ANOVA p < 1x10-4). For RF, the relationship with SAS was best fit with a linear model (beta p < 1x10-4) while RCS model was a better fit for EAS (ANOVA p < 1x10-4). In this study, we identify relationships between genetic ancestry and phenotypes that are best fit with non-linear modeling techniques. The assumption of linearity for regression modeling is integral for proper fitting of a model and there is no knowing a priori to modeling if the relationship is truly linear.
在研究中,人们希望摒弃将种族作为临床因素的概念,因为种族是一种社会结构,被用作地理血统的不精确替代物。在本研究中,我们利用范德比尔特大学医学中心的生物库(BioVU)来研究遗传血统比例与临床表型之间的关系。对于 BioVU 的所有样本,我们根据《1000 基因组》参考文献计算了六种祖先比例:非洲东部(EAFR)、非洲西部(WAFR)、欧洲北部(NEUR)、欧洲南部(SEUR)、亚洲东部(EAS)和亚洲南部(SAS)。从 PheWAS 中,我们发现在 EAFR、WAFR 和 SEUR 中,phecode 类别显著富集肿瘤;在 SEUR、NEUR、SAS 和 EAS 中,显著富集妊娠并发症(p < 0.003)。然后,我们选择了表型高血压(HTN)和心房颤动(AFib),使用逻辑回归模型和非线性限制立方样条模型(RCS)进一步研究这些表型与 EAFR、WAFR、SEUR 和 NEUR 之间的关系。对于 EAS 和 SAS,我们选择肾衰竭(RF)进行进一步建模。线性模型最符合高血压和心房颤动与祖先 EAFR、WAFR 和 SEUR 之间的关系(所有模型的贝塔值 p < 1x10-4),而 RCS 最符合与 NEUR 之间的关系(高血压方差分析 p = 0.001,心房颤动方差分析 p < 1x10-4)。就 RF 而言,线性模型最符合与 SAS 的关系(β p < 1x10-4),而 RCS 模型更符合与 EAS 的关系(方差分析 p < 1x10-4)。在这项研究中,我们确定了非线性建模技术最适合的遗传血统与表型之间的关系。回归建模的线性假设是正确拟合模型不可或缺的条件,而且在建模之前无法知道两者之间是否真的存在线性关系。
{"title":"EVALUATING THE RELATIONSHIPS BETWEEN GENETIC ANCESTRY AND THE CLINICAL PHENOME.","authors":"Jacqueline A. Piekos, Jeewoo Kim, Jacob M. Keaton, J. Hellwege, Todd L. Edwards, D. V. Velez Edwards","doi":"10.1142/9789811286421_0030","DOIUrl":"https://doi.org/10.1142/9789811286421_0030","url":null,"abstract":"There is a desire in research to move away from the concept of race as a clinical factor because it is a societal construct used as an imprecise proxy for geographic ancestry. In this study, we leverage the biobank from Vanderbilt University Medical Center, BioVU, to investigate relationships between genetic ancestry proportion and the clinical phenome. For all samples in BioVU, we calculated six ancestry proportions based on 1000 Genomes references: eastern African (EAFR), western African (WAFR), northern European (NEUR), southern European (SEUR), eastern Asian (EAS), and southern Asian (SAS). From PheWAS, we found phecode categories significantly enriched neoplasms for EAFR, WAFR, and SEUR, and pregnancy complication in SEUR, NEUR, SAS, and EAS (p < 0.003). We then selected phenotypes hypertension (HTN) and atrial fibrillation (AFib) to further investigate the relationships between these phenotypes and EAFR, WAFR, SEUR, and NEUR using logistic regression modeling and non-linear restricted cubic spline modeling (RCS). For EAS and SAS, we chose renal failure (RF) for further modeling. The relationships between HTN and AFib and the ancestries EAFR, WAFR, and SEUR were best fit by the linear model (beta p < 1x10-4 for all) while the relationships with NEUR were best fit with RCS (HTN ANOVA p = 0.001, AFib ANOVA p < 1x10-4). For RF, the relationship with SAS was best fit with a linear model (beta p < 1x10-4) while RCS model was a better fit for EAS (ANOVA p < 1x10-4). In this study, we identify relationships between genetic ancestry and phenotypes that are best fit with non-linear modeling techniques. The assumption of linearity for regression modeling is integral for proper fitting of a model and there is no knowing a priori to modeling if the relationship is truly linear.","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"82 ","pages":"389-403"},"PeriodicalIF":0.0,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}